Python Forum
Finding the median of a column in a huge CSV file
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Finding the median of a column in a huge CSV file
#1
Copying the column from the CSV to a DataFrame causes RAM to run out -> 100 GB - yes it is a big file

I would appreciate some help to implement a solution working with the CSV file directly - Quite remarkably, I searched and could not find an example solution for this activity.

Any help appreciated.
Reply
#2
Here's a blog (I didn't read the entire artcle) on how to effeciently deal with massive CSV files using pandas.
Optimized ways to Read Large CSVs in Python

More from google scholar:
  • Go to google scholar: https://scholar.google.com/
  • Query on 'python solutions for reading massive csv files (over 100 Gb)'
    You will get a large list of papers on the subject.
    [/list
Reply
#3
(Jan-24-2023, 10:30 AM)Larz60+ Wrote: Here's a blog (I didn't read the entire artcle) on how to effeciently deal with massive CSV files using pandas.
Optimized ways to Read Large CSVs in Python

More from google scholar:
  • Go to google scholar: https://scholar.google.com/
  • Query on 'python solutions for reading massive csv files (over 100 Gb)'
    You will get a large list of papers on the subject.
    [/list

thx for the pointer, I saw that article and it does not help solve this issue. Getting chunks does not help, and the other library does not appear to be any better than pandas when it comes to doing median on a large dataset.

I'm hoping that someone has an example of how to work with a massive file to get the median of an unsorted column
Reply
#4
It is always good to be precise while expressing yourself. Do you want help or you want that somebody writes you a code?

If you want help then show your code and define your problem- not enough memory, speed, how to calculate median etc.
I'm not 'in'-sane. Indeed, I am so far 'out' of sane that you appear a tiny blip on the distant coast of sanity. Bucky Katt, Get Fuzzy

Da Bishop: There's a dead bishop on the landing. I don't know who keeps bringing them in here. ....but society is to blame.
Reply
#5
Not too sure if this is of any help:

If we assume that this very small sample CSV file represents your very large CSV file...

Output:
10,11,12,13,14,15,16,17,18,19 20,21,22,23,24,25,26,27,28,29 30,31,32,33,34,35,36,37,38,39
... and you want column four (from a zero index), this code will accumulate all the values from said column, from which you could do whatever math operation you choose.

import csv
values = []
column = 4
with open('data.csv') as data:
    reader = csv.reader(data)
    for row in reader:
        values.append(int(row[column]))

print(values)
print(sum(values))
Output:
[14, 24, 34] 72
Sig:
>>> import this

The UNIX philosophy: "Do one thing, and do it well."

"The danger of computers becoming like humans is not as great as the danger of humans becoming like computers." :~ Konrad Zuse

"Everything should be made as simple as possible, but not simpler." :~ Albert Einstein
Reply
#6
import csv
from operator import itemgetter


def get_avg(file, column, data_type, skip_header):
    
    getter = itemgetter(column)
    
    total = 0
    count = 0
    
    with open(file, encoding="utf8", newline="") as fd:
        if skip_header:
            next(fd)
            
        reader = csv.reader(fd)
        for value in map(getter, reader):
            total += data_type(value)
            count += 1
            
    return total / count


# read your.csv, skipping the first line, getting the 3rd column, converting it to int.
avg1 = get_avg("your.csv", 2, int, True)
This does only work with valid data. If for example at any place the 3rd column is not an integer, then an ValueError is thrown. So before you start, test it with a smaller dataset and keep in mind, that data is often dirty.

If the data is dirty, you could skip invalid fields.


import csv
from operator import itemgetter


def get_avg(file, column, data_type, skip_header):
    
    getter = itemgetter(column)
    
    total = 0
    count = 0
    
    with open(file, encoding="utf8", newline="") as fd:
        if skip_header:
            next(fd)
            
        reader = csv.reader(fd)
        # usually counting lines starts with 1
        # if the header is skipped, then the start is line 2
        for line, value in enumerate(map(getter, reader), start=1 + skip_header):
            try:
                total += data_type(value)
            except ValueError:
                # do not count invalid data
                print(f"[{line}]: '{value}'")
                continue

            count += 1
            
    return total / count


# read your.csv, skipping the first line, getting the 3rd column, converting it to float.
avg2 = get_avg("your.csv", 2, float, True)
Almost dead, but too lazy to die: https://sourceserver.info
All humans together. We don't need politicians!
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  [solved] how to speed-up huge data in an ascii file ? paul18fr 4 1,264 May-16-2023, 08:36 PM
Last Post: paul18fr
  visualizing huge correation matrix erdemath 3 2,016 Oct-13-2021, 09:44 AM
Last Post: erdemath
  How to denoise ECG Signal with median filter using WFDB for Python? fhp0223 0 2,194 Aug-05-2020, 07:10 AM
Last Post: fhp0223
  Filter data based on a value from another dataframe column and create a file using lo pawanmtm 1 4,290 Jul-15-2020, 06:20 PM
Last Post: pawanmtm
  How to print a column name in csv file Truman 1 4,440 Mar-31-2020, 03:34 AM
Last Post: Larz60+
  huge and weird values after applying some calculations karlito 2 2,162 Dec-13-2019, 08:32 AM
Last Post: karlito
  Change column names from a file Nidhesh 2 3,019 Jul-08-2019, 06:00 AM
Last Post: Nidhesh
  copy one column from csv file and paste into xls file kprogrammer 0 4,383 Nov-03-2018, 04:03 PM
Last Post: kprogrammer
  Calculating median value from time data series mkaru 1 5,109 Aug-22-2018, 08:41 AM
Last Post: Mekire
  Upload csv file as numbers (floating?) and extract element, row, and column bentaz 7 4,512 Mar-19-2018, 05:34 PM
Last Post: bentaz

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020