Python Forum

Full Version: need algorithm to strip non-ascii characters from LONG csv file
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
The Canadian Census offers some interesting csv files that are legally free to download and can lead to some fun doing data analysis. Sadly, these files are crammed with what appear to be purposeless Unicode characters, all of them used as padding in various cells without contributing to the data in any way. Is there a reasonably efficient algorithm to strip the Unicode characters from csv files that contain millions of characters?

EDIT: the longest of these files is more than 193,000K and here's the link to all of them: http://www12.statcan.gc.ca/census-recens...cfm?Lang=E
An example? I have downloaded one of these and didn't see any unusual characters.
Same here. What are you using to read the files where you see the corruption?
Sorry, but it was a minor stupid mistake in my code. There are no unusual characters in those files. Maybe this post could be deleted or archived. I apologize for wasting everyone's time.