Finding the median of a column in a huge CSV file

Finding the median of a column in a huge CSV file - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: Data Science (https://python-forum.io/forum-44.html)
+--- Thread: Finding the median of a column in a huge CSV file (/thread-39268.html)

Finding the median of a column in a huge CSV file - markagregory - Jan-24-2023

Copying the column from the CSV to a DataFrame causes RAM to run out -> 100 GB - yes it is a big file

I would appreciate some help to implement a solution working with the CSV file directly - Quite remarkably, I searched and could not find an example solution for this activity.

Any help appreciated.

RE: Finding the median of a column in a huge CSV file - Larz60+ - Jan-24-2023

Here's a blog (I didn't read the entire artcle) on how to effeciently deal with massive CSV files using pandas.
Optimized ways to Read Large CSVs in Python

More from google scholar:

Go to google scholar: https://scholar.google.com/
Query on 'python solutions for reading massive csv files (over 100 Gb)'
You will get a large list of papers on the subject.
[/list

RE: Finding the median of a column in a huge CSV file - markagregory - Jan-24-2023

(Jan-24-2023, 10:30 AM)Larz60+ Wrote: Here's a blog (I didn't read the entire artcle) on how to effeciently deal with massive CSV files using pandas.
Optimized ways to Read Large CSVs in Python

More from google scholar:
Go to google scholar: https://scholar.google.com/

Query on 'python solutions for reading massive csv files (over 100 Gb)'
You will get a large list of papers on the subject.
[/list

thx for the pointer, I saw that article and it does not help solve this issue. Getting chunks does not help, and the other library does not appear to be any better than pandas when it comes to doing median on a large dataset.

I'm hoping that someone has an example of how to work with a massive file to get the median of an unsorted column

RE: Finding the median of a column in a huge CSV file - perfringo - Jan-24-2023

It is always good to be precise while expressing yourself. Do you want help or you want that somebody writes you a code?

If you want help then show your code and define your problem- not enough memory, speed, how to calculate median etc.

RE: Finding the median of a column in a huge CSV file - rob101 - Jan-24-2023

Not too sure if this is of any help:

If we assume that this very small sample CSV file represents your very large CSV file...

Output:10,11,12,13,14,15,16,17,18,19
20,21,22,23,24,25,26,27,28,29
30,31,32,33,34,35,36,37,38,39

... and you want column four (from a zero index), this code will accumulate all the values from said column, from which you could do whatever math operation you choose.

import csv
values = []
column = 4
with open('data.csv') as data:
    reader = csv.reader(data)
    for row in reader:
        values.append(int(row[column]))

print(values)
print(sum(values))

Output:[14, 24, 34]
72

RE: Finding the median of a column in a huge CSV file - DeaD_EyE - Jan-24-2023

import csv
from operator import itemgetter


def get_avg(file, column, data_type, skip_header):
    
    getter = itemgetter(column)
    
    total = 0
    count = 0
    
    with open(file, encoding="utf8", newline="") as fd:
        if skip_header:
            next(fd)
            
        reader = csv.reader(fd)
        for value in map(getter, reader):
            total += data_type(value)
            count += 1
            
    return total / count


# read your.csv, skipping the first line, getting the 3rd column, converting it to int.
avg1 = get_avg("your.csv", 2, int, True)

This does only work with valid data. If for example at any place the 3rd column is not an integer, then an ValueError is thrown. So before you start, test it with a smaller dataset and keep in mind, that data is often dirty.

If the data is dirty, you could skip invalid fields.

import csv
from operator import itemgetter


def get_avg(file, column, data_type, skip_header):
    
    getter = itemgetter(column)
    
    total = 0
    count = 0
    
    with open(file, encoding="utf8", newline="") as fd:
        if skip_header:
            next(fd)
            
        reader = csv.reader(fd)
        # usually counting lines starts with 1
        # if the header is skipped, then the start is line 2
        for line, value in enumerate(map(getter, reader), start=1 + skip_header):
            try:
                total += data_type(value)
            except ValueError:
                # do not count invalid data
                print(f"[{line}]: '{value}'")
                continue

            count += 1
            
    return total / count


# read your.csv, skipping the first line, getting the 3rd column, converting it to float.
avg2 = get_avg("your.csv", 2, float, True)