Hello, I'm pretty new to Python, so please bear over with me, if I ask any beginner's questions.
I'm trying to work on multiple files in a folder, my code seem to be working. But I have some encoding problem. I get the "ufeff" including in some of my printed text.
I have saved kall the files with the UTF-8 format. And used "encoding = "utf-8" in my code. It always worked before.
Can anyone identify why it doesn't work or how I could maybe alter my code (the placement of "encoding = "utf-8" for example.
Import os
path = "/Users/nikolajkorsgaard/Documents/Eksamenen"
for filename in os.listdir(path):
with open(filename, "r", encoding = "utf-8") as f:
if filename.endswith(".txt"):
for line in f:
print(line.split())
You can 'ignore' or 'replace' letters, if there are problems with encoding. You have also some other
options. It's described here:
Built-in Functions
Look for the argument "errors"
But this won't fix broken encodings. To fix a broken encoding, you can use tools like
ftfy.
This can be a approach to fix errors:
from pathlib import Path
import ftfy
path = Path.home() / "Documents/Eksamenen"
def read_text(file):
binary_data = file.read_bytes()
# read data binary
try:
# try to decode binary data with utf8
text = binary_data.decode() # utf8 is implicit the default
print(text_file, 'has right encoding')
except UnicodeDecodeError:
# if the decoding raised an decode error
# try to load the file as text, interpret this as
# utf8 and ignore all errors
# then this broken text is tried to fix by ftfy
# this module can fix broken encodings
text = ftfy.fix_encoding(file.read_text(errors='ignore'))
print(file, 'has wrong encoding. Fixed the encoding with ftfy.')
return text
def read_text_files(root):
root = Path(root)
results = []
for text_file in root.glob('*.txt'):
text = read_text(text_file)
results.append(text)
return results
If you are not interested in fixing broken encodings, you can ignore or replace them and then ftfy is not needed.
With your original function:
for filename in os.listdir(path):
with open(filename, "r", encoding="utf-8", errors='ignore') as f:
if filename.endswith(".txt"):
for line in f:
print(line.split())
In my example I use Path objects, because they are easier to handle.
Is there a simpler way just to target "\ufeff". When I look at my printed shell, it's only the title of the documents the a encoding problem.
I've removed unwanted punctuation and symbol with:
for char in '•,.?!)(][%$#&\/=;:':
word = word.replace(char,'')
Can I do it like that in a simple way with little lines of code.
import nltk
import ssl
try:
_create_unverified_https_context = ssl._create_unverified_context
except AttributeError:
pass
else:
ssl._create_default_https_context = _create_unverified_https_context
import os
import string
path = "/Users/nikolajkorsgaard/Documents/Eksamenen"
d={}
for filename in os.listdir(path):
with open(filename, encoding = "utf-8") as f:
if filename.endswith(".txt"):
for word in f:
word=word.lower()
for char in '•,.?!)(][%$#&\/=;:':
word = word.replace(char,'')
tokens = nltk.word_tokenize(word)
print(tokens)
Nevermind, will a moderater please delete this thread? - this is an exam project, so I probably shouldn't writing this when I think
(Jun-05-2019, 04:05 PM)NikolajKorsgaard Wrote: [ -> ]Nevermind, will a moderater please delete this thread? - this is an exam project, so I probably shouldn't writing this when I think
That's one of the exact reasons we
don't delete posts - if you shouldn't have posted, then it was academic dishonesty, and we want professors to be able to discover that.