Python Forum

Full Version: access a very large file? As an array or as a dataframe?
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Good morning. I have just found this forum and I am really very glad.
I have a project to do which relates to unsupervised machine learning so I use python packets and tools such as numpy, scipy, pandas primarily.
However, for a long time I've stuck at a key point, and more specifically, how to access a very large file, without crashing my system.
Let me explain.
The initial csv file contains some data and read it with loadtxt as an  4.664.604 x 3 array.

B_init = np.loadtxt(open("out.munmun_twitterex_ut"), skiprows=1, usecols=(0,1,2)).astype(int);
The next step has made me difficult a lot. I want to keep in a secont array only these rows which fullfill some conditions. 
I have thought various ways. The last one is to create two separate lists from the first and second column respectively (I am interestedn in these two columns), to eliminate the number of elements and to reconstruct the new array. But my system is crashing!
I read that dataframes from pandas that its easier the access but again its a very large file.

#the second column is a list with the users

[user_list.append(pairs[1]) for pairs in A]

#create collections to count how many times each user is appeared (me th xrhsh twn collections wa datatypes metraw poses fores emfanizetai o kathe xrhsths)
user_counter1 = Counter(user_list)
#print user_counter1, "\n"

#the first column is a list with the tags
[tags_list.append(pairs[0]) for pairs in A]
#print np.unique(tags_list)

#create collections to count how many times each tag is appeared
tags_counter1 = Counter(tags_list)
#print tags_counter1,"\n"

#keep the users with value > 80 it means each one has at least 80 tags(pairnw tous users pou exoun perissotera apo 80 Tags)
[final_users1.append(key) for key,value in user_counter1.iteritems() if value > 80]
#print len(final_users1)

#create the desired table. The systes is crashng at this point!
df = pd.DataFrame(index=np.unique(tags_list),columns=final_users1)
for s in A:
    df.loc[s[0],s[1]] = 1

df = df.fillna(0)

C = df.as_matrix()
.

Is there any idea how can  I access and compare the elements of an array of this size. I am very confused.

I don't know what else can I think and do.

Thankyou anyway
Angelika  Smile Angel Angel
(May-16-2017, 11:53 AM)Angelika Wrote: [ -> ]Good morning. I have just found this forum and I am really very glad.
I have a project to do which relates to unsupervised machine learning so I use python packets and tools such as numpy, scipy, pandas primarily.
However, for a long time I've stuck at a key point, and more specifically, how to access a very large file, without crashing my system.
Let me explain.
The initial csv file contains some data and read it with loadtxt as an  4.664.604 x 3 array.

B_init = np.loadtxt(open("out.munmun_twitterex_ut"), skiprows=1, usecols=(0,1,2)).astype(int);
The next step has made me difficult a lot. I want to keep in a secont array only these rows which fullfill some conditions. 
I have thought various ways. The last one is to create two separate lists from the first and second column respectively (I am interestedn in these two columns), to eliminate the number of elements and to reconstruct the new array. But my system is crashing!
I read that dataframes from pandas that its easier the access but again its a very large file.

#the second column is a list with the users

[user_list.append(pairs[1]) for pairs in A]

#create collections to count how many times each user is appeared (me th xrhsh twn collections wa datatypes metraw poses fores emfanizetai o kathe xrhsths)
user_counter1 = Counter(user_list)
#print user_counter1, "\n"

#the first column is a list with the tags
[tags_list.append(pairs[0]) for pairs in A]
#print np.unique(tags_list)

#create collections to count how many times each tag is appeared
tags_counter1 = Counter(tags_list)
#print tags_counter1,"\n"

#keep the users with value > 80 it means each one has at least 80 tags(pairnw tous users pou exoun perissotera apo 80 Tags)
[final_users1.append(key) for key,value in user_counter1.iteritems() if value > 80]
#print len(final_users1)

#create the desired table. The systes is crashng at this point!
df = pd.DataFrame(index=np.unique(tags_list),columns=final_users1)
for s in A:
    df.loc[s[0],s[1]] = 1

df = df.fillna(0)

C = df.as_matrix()
.

Is there any idea how can  I access and compare the elements of an array of this size. I am very confused.

I don't know what else can I think and do.

Thankyou anyway
Angelika  Smile Angel Angel

Can you elaborate on "The systes is crashng at this point!"? Python error? Blue screen of death? System freeze? Smoke?
You are right, I didn't clarify what happens.
System freeze and I heve to restart the pc to use it again.
Freezing with the hard disk going wild? You may be exceeding your RAM and the system starts swapping. Usually not a good thing on PCs. Otherwise your CPU just gets very busy.

Another possibility is just thermal overload. Not that many PCs are able to withstand their CPU going full blast for more than a minute. There are programs to test this.

A good thing to do is to run a CPU/memory/Disk I/O monitor when you run the program. Or start with smaller data sets and increase their size progressively (on windows, one of th tabs of the process monitor).
Probably running out of memory, as Ofnuts pointed.

Your initial file is not very large, int array with shape (4664604, 3) needs only about 110MB of memory. It is not clear what "A" is, I guess that it is array consisting of first two columns of B_init?  It seems that you try to create rather big dataframe with np.unique(tags_list) rows and len(final_users1) columns - it can be easily something like 100000 rows and 10000 columns (depends on your data). You are using only few millions values, so perhaps you can use something more memory efficient like sparse matrices from scipy.sparse?

for s in A:
    df.loc[s[0],s[1]] = 1
will create new columns for users (s[1] values) not in final_users1, so you should either check s[1] against final_users1 or abandon your > 80 filter.

And using list comprehension such as
[user_list.append(pairs[1]) for pairs in A]
for side effect just to save one line compared to a for loop is extremely ugly and should be avoided ...
Thank you very much!
You helped me a lot!!!!
I think that found the problem!!!!