Reading floats and ints from csv-like file

Krookroo · (This post was last modified: Sep-05-2017, 12:48 PM by Krookroo.)

@DeaD_EyE:
You think there is more to it than the fact that the file is written with mixed encodings making it ultra hard to process via python?
I spent one hour after my last post trying to force encodings and decodings, both in python and in the original program, with no success (the program that writes the file has an option to decide the encoding in which it opens the file to write in... which doesn't work and does nothing).

Quote:You can ship around this problem, if you open the file with the right encoding.
Python Code: (Double-click to select all)
1

with open(filename, encoding='utf-8-sig') as csvfile:

This didn't work for me, because the first byte is not compatible with that encoding so it raised an error. All the encoding/decoding was failing on the first byte of the file.

---------------------------------------------------------------------------------

Quote:Well, just strip() these ... things:
>>> "��1.12005000,1.11800000,14574".strip("�")
'1.12005000,1.11800000,14574'

How would I know these things will be the same for the other files that my script will generate(to be more precise: it's not)? If I have to manually adapt the stripping to each file, I'll just manually copy/paste the date into another file that I save in the right encoding. And if I know these will allways be the first two characters, I think my solution of working in the first line with

a = file.readline()
a = a[2:]

is more general. Even then, it seems it will allways be the first two characters but I can't be sure of that.

**Larz60+** · Sep-05-2017, 01:05 PM

using strip is probably safer
you can get the numeric value of the ? by using ord(character)
this one has a value of 65533 decimal or 'ffff0x' hex

hbknjr · Sep-05-2017, 01:15 PM

Didn't read the whole post, but the problem seems to be related to the encoding.

1- Try setting chcp 65001 in your console, which changes the code page to UTF-8. It could be that console encoding is different.

2- Try using codecs.

import codecs
with codecs.open("EU-1H.tx.txt",encoding='utf-8') as f:
....
....

DeaD_EyE · Sep-05-2017, 02:00 PM

Puh, when the whole thing is so undefined, thats not easy to solve correctly.

I have the absolute brute-force method for it:

import string


def filter_text(text, allowed=string.digits + '.,'):
    return ''.join(filter(lambda c: c in allowed, text))

def process_line(line):
    return list(map(float, filter_text(line).split(',')))
    
    
if __name__ == '__main__':
    with open('test.csv', errors='replace') as fd:
        for line in fd:
            print(process_line(line))

Test this against your data. With this code it is even possible to parse data, when normal chars are between the numbers.
You should call it brute_fore_reader.py

***sparkz_alot*** · Sep-05-2017, 02:12 PM

As of Unicode 10, the black diamond question mark (hex fffd) is used "to replace an unknown, unrecognized or unrepresentable character". Trying to decode it as utf-8 will fail, because utf-8 has already said it doesn't know what it is. One possible cause is that the characters do not lie in the Basic Multilingual Plane of 65424 code points. Might we ask, what language the original file is written in?

It seems you should be able to test for those characters, and if they exist, split them out, if they don't, proceed as normal.

Krookroo · Sep-05-2017, 03:58 PM

Thanks for your inputs guys, lots of leads, I'll investigate into all this tomorrow

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	problems with reading csv file.	MassiJames	3	632	Nov-16-2023, 03:41 PM Last Post: snippsat
	When is it safe to compare (==) two floats?	Radical	4	709	Nov-12-2023, 11:53 AM Last Post: PyDan
	Reading a file name fron a folder on my desktop	Fiona	4	916	Aug-23-2023, 11:11 AM Last Post: Axel_Erfurt
	Reading data from excel file –> process it >>then write to another excel output file	Jennifer_Jone	0	1,102	Mar-14-2023, 07:59 PM Last Post: Jennifer_Jone
	Reading a file	JonWayn	3	1,095	Dec-30-2022, 10:18 AM Last Post: ibreeden
	Reading Specific Rows In a CSV File	finndude	3	989	Dec-13-2022, 03:19 PM Last Post: finndude
	Excel file reading problem	max70990	1	896	Dec-11-2022, 07:00 PM Last Post: deanhystad
	Replace columns indexes reading a XSLX file	Larry1888	2	989	Nov-18-2022, 10:16 PM Last Post: Pedroski55
	[split] why can't i create a list of numbers (ints) with random.randrange()	astral_travel	7	1,516	Oct-23-2022, 11:13 PM Last Post: Pedroski55
	Failing reading a file and cannot exit it...	tester_V	8	1,804	Aug-19-2022, 10:27 PM Last Post: tester_V

Reading floats and ints from csv-like file

User Panel Messages

Announcements