Posts: 2
Threads: 1
Joined: Jan 2023
Copying the column from the CSV to a DataFrame causes RAM to run out -> 100 GB - yes it is a big file
I would appreciate some help to implement a solution working with the CSV file directly - Quite remarkably, I searched and could not find an example solution for this activity.
Any help appreciated.
Posts: 12,022
Threads: 484
Joined: Sep 2016
Here's a blog (I didn't read the entire artcle) on how to effeciently deal with massive CSV files using pandas.
Optimized ways to Read Large CSVs in Python
More from google scholar: - Go to google scholar: https://scholar.google.com/
- Query on 'python solutions for reading massive csv files (over 100 Gb)'
You will get a large list of papers on the subject.
[/list
Posts: 2
Threads: 1
Joined: Jan 2023
(Jan-24-2023, 10:30 AM)Larz60+ Wrote: Here's a blog (I didn't read the entire artcle) on how to effeciently deal with massive CSV files using pandas.
Optimized ways to Read Large CSVs in Python
More from google scholar:- Go to google scholar: https://scholar.google.com/
- Query on 'python solutions for reading massive csv files (over 100 Gb)'
You will get a large list of papers on the subject.
[/list
thx for the pointer, I saw that article and it does not help solve this issue. Getting chunks does not help, and the other library does not appear to be any better than pandas when it comes to doing median on a large dataset.
I'm hoping that someone has an example of how to work with a massive file to get the median of an unsorted column
Posts: 1,950
Threads: 8
Joined: Jun 2018
It is always good to be precise while expressing yourself. Do you want help or you want that somebody writes you a code?
If you want help then show your code and define your problem- not enough memory, speed, how to calculate median etc.
I'm not 'in'-sane. Indeed, I am so far 'out' of sane that you appear a tiny blip on the distant coast of sanity. Bucky Katt, Get Fuzzy
Da Bishop: There's a dead bishop on the landing. I don't know who keeps bringing them in here. ....but society is to blame.
Posts: 453
Threads: 16
Joined: Jun 2022
Not too sure if this is of any help:
If we assume that this very small sample CSV file represents your very large CSV file...
Output: 10,11,12,13,14,15,16,17,18,19
20,21,22,23,24,25,26,27,28,29
30,31,32,33,34,35,36,37,38,39
... and you want column four (from a zero index), this code will accumulate all the values from said column, from which you could do whatever math operation you choose.
1 2 3 4 5 6 7 8 9 10 |
import csv
values = []
column = 4
with open ( 'data.csv' ) as data:
reader = csv.reader(data)
for row in reader:
values.append( int (row[column]))
print (values)
print ( sum (values))
|
Output: [14, 24, 34]
72
Sig:
>>> import this
The UNIX philosophy: "Do one thing, and do it well."
"The danger of computers becoming like humans is not as great as the danger of humans becoming like computers." :~ Konrad Zuse
"Everything should be made as simple as possible, but not simpler." :~ Albert Einstein
Posts: 2,120
Threads: 10
Joined: May 2017
Jan-24-2023, 04:22 PM
(This post was last modified: Jan-24-2023, 04:23 PM by DeaD_EyE.)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 |
import csv
from operator import itemgetter
def get_avg( file , column, data_type, skip_header):
getter = itemgetter(column)
total = 0
count = 0
with open ( file , encoding = "utf8" , newline = "") as fd:
if skip_header:
next (fd)
reader = csv.reader(fd)
for value in map (getter, reader):
total + = data_type(value)
count + = 1
return total / count
avg1 = get_avg( "your.csv" , 2 , int , True )
|
This does only work with valid data. If for example at any place the 3rd column is not an integer, then an ValueError is thrown. So before you start, test it with a smaller dataset and keep in mind, that data is often dirty.
If the data is dirty, you could skip invalid fields.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 |
import csv
from operator import itemgetter
def get_avg( file , column, data_type, skip_header):
getter = itemgetter(column)
total = 0
count = 0
with open ( file , encoding = "utf8" , newline = "") as fd:
if skip_header:
next (fd)
reader = csv.reader(fd)
for line, value in enumerate ( map (getter, reader), start = 1 + skip_header):
try :
total + = data_type(value)
except ValueError:
print ( f "[{line}]: '{value}'" )
continue
count + = 1
return total / count
avg2 = get_avg( "your.csv" , 2 , float , True )
|
|