Python Forum
access a very large file? As an array or as a dataframe?
Thread Rating:
  • 2 Vote(s) - 3.5 Average
  • 1
  • 2
  • 3
  • 4
  • 5
access a very large file? As an array or as a dataframe?
#1
Good morning. I have just found this forum and I am really very glad.
I have a project to do which relates to unsupervised machine learning so I use python packets and tools such as numpy, scipy, pandas primarily.
However, for a long time I've stuck at a key point, and more specifically, how to access a very large file, without crashing my system.
Let me explain.
The initial csv file contains some data and read it with loadtxt as an  4.664.604 x 3 array.

B_init = np.loadtxt(open("out.munmun_twitterex_ut"), skiprows=1, usecols=(0,1,2)).astype(int);
The next step has made me difficult a lot. I want to keep in a secont array only these rows which fullfill some conditions. 
I have thought various ways. The last one is to create two separate lists from the first and second column respectively (I am interestedn in these two columns), to eliminate the number of elements and to reconstruct the new array. But my system is crashing!
I read that dataframes from pandas that its easier the access but again its a very large file.

#the second column is a list with the users

[user_list.append(pairs[1]) for pairs in A]

#create collections to count how many times each user is appeared (me th xrhsh twn collections wa datatypes metraw poses fores emfanizetai o kathe xrhsths)
user_counter1 = Counter(user_list)
#print user_counter1, "\n"

#the first column is a list with the tags
[tags_list.append(pairs[0]) for pairs in A]
#print np.unique(tags_list)

#create collections to count how many times each tag is appeared
tags_counter1 = Counter(tags_list)
#print tags_counter1,"\n"

#keep the users with value > 80 it means each one has at least 80 tags(pairnw tous users pou exoun perissotera apo 80 Tags)
[final_users1.append(key) for key,value in user_counter1.iteritems() if value > 80]
#print len(final_users1)

#create the desired table. The systes is crashng at this point!
df = pd.DataFrame(index=np.unique(tags_list),columns=final_users1)
for s in A:
    df.loc[s[0],s[1]] = 1

df = df.fillna(0)

C = df.as_matrix()
.

Is there any idea how can  I access and compare the elements of an array of this size. I am very confused.

I don't know what else can I think and do.

Thankyou anyway
Angelika  Smile Angel Angel
Reply
#2
(May-16-2017, 11:53 AM)Angelika Wrote: Good morning. I have just found this forum and I am really very glad.
I have a project to do which relates to unsupervised machine learning so I use python packets and tools such as numpy, scipy, pandas primarily.
However, for a long time I've stuck at a key point, and more specifically, how to access a very large file, without crashing my system.
Let me explain.
The initial csv file contains some data and read it with loadtxt as an  4.664.604 x 3 array.

B_init = np.loadtxt(open("out.munmun_twitterex_ut"), skiprows=1, usecols=(0,1,2)).astype(int);
The next step has made me difficult a lot. I want to keep in a secont array only these rows which fullfill some conditions. 
I have thought various ways. The last one is to create two separate lists from the first and second column respectively (I am interestedn in these two columns), to eliminate the number of elements and to reconstruct the new array. But my system is crashing!
I read that dataframes from pandas that its easier the access but again its a very large file.

#the second column is a list with the users

[user_list.append(pairs[1]) for pairs in A]

#create collections to count how many times each user is appeared (me th xrhsh twn collections wa datatypes metraw poses fores emfanizetai o kathe xrhsths)
user_counter1 = Counter(user_list)
#print user_counter1, "\n"

#the first column is a list with the tags
[tags_list.append(pairs[0]) for pairs in A]
#print np.unique(tags_list)

#create collections to count how many times each tag is appeared
tags_counter1 = Counter(tags_list)
#print tags_counter1,"\n"

#keep the users with value > 80 it means each one has at least 80 tags(pairnw tous users pou exoun perissotera apo 80 Tags)
[final_users1.append(key) for key,value in user_counter1.iteritems() if value > 80]
#print len(final_users1)

#create the desired table. The systes is crashng at this point!
df = pd.DataFrame(index=np.unique(tags_list),columns=final_users1)
for s in A:
    df.loc[s[0],s[1]] = 1

df = df.fillna(0)

C = df.as_matrix()
.

Is there any idea how can  I access and compare the elements of an array of this size. I am very confused.

I don't know what else can I think and do.

Thankyou anyway
Angelika  Smile Angel Angel

Can you elaborate on "The systes is crashng at this point!"? Python error? Blue screen of death? System freeze? Smoke?
Unless noted otherwise, code in my posts should be understood as "coding suggestions", and its use may require more neurones than the two necessary for Ctrl-C/Ctrl-V.
Your one-stop place for all your GIMP needs: gimp-forum.net
Reply
#3
You are right, I didn't clarify what happens.
System freeze and I heve to restart the pc to use it again.
Reply
#4
Freezing with the hard disk going wild? You may be exceeding your RAM and the system starts swapping. Usually not a good thing on PCs. Otherwise your CPU just gets very busy.

Another possibility is just thermal overload. Not that many PCs are able to withstand their CPU going full blast for more than a minute. There are programs to test this.

A good thing to do is to run a CPU/memory/Disk I/O monitor when you run the program. Or start with smaller data sets and increase their size progressively (on windows, one of th tabs of the process monitor).
Unless noted otherwise, code in my posts should be understood as "coding suggestions", and its use may require more neurones than the two necessary for Ctrl-C/Ctrl-V.
Your one-stop place for all your GIMP needs: gimp-forum.net
Reply
#5
Probably running out of memory, as Ofnuts pointed.

Your initial file is not very large, int array with shape (4664604, 3) needs only about 110MB of memory. It is not clear what "A" is, I guess that it is array consisting of first two columns of B_init?  It seems that you try to create rather big dataframe with np.unique(tags_list) rows and len(final_users1) columns - it can be easily something like 100000 rows and 10000 columns (depends on your data). You are using only few millions values, so perhaps you can use something more memory efficient like sparse matrices from scipy.sparse?

for s in A:
    df.loc[s[0],s[1]] = 1
will create new columns for users (s[1] values) not in final_users1, so you should either check s[1] against final_users1 or abandon your > 80 filter.

And using list comprehension such as
[user_list.append(pairs[1]) for pairs in A]
for side effect just to save one line compared to a for loop is extremely ugly and should be avoided ...
Reply
#6
Thank you very much!
You helped me a lot!!!!
I think that found the problem!!!!
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Reading large crapy text file in anaconda to profile data syamatunuguntla 0 812 Nov-18-2022, 06:15 PM
Last Post: syamatunuguntla
  export dataframe to file.txt dramauh 5 1,886 Apr-21-2022, 01:23 AM
Last Post: sarahroxon7
  Dataframe with array value Tibovdv 2 2,401 Mar-24-2021, 07:22 PM
Last Post: Tibovdv
  Export dataframe to xlsx - Error "zipfile.BadZipFile: File is not a zip file" Baggio 10 61,431 Mar-12-2021, 01:02 PM
Last Post: buran
  How to form a dataframe reading separate dictionaries from .txt file? Doug 1 4,199 Nov-09-2020, 09:24 AM
Last Post: PsyPy
  Chunking and Sorting a large file Robotguy 1 3,544 Jul-29-2020, 12:48 AM
Last Post: Larz60+
  extracting sublist from a large multiple molecular file juliocollm 2 2,262 May-25-2020, 12:49 PM
Last Post: juliocollm
  converting dataframe to int numpy array glennford49 1 2,290 Apr-04-2020, 06:15 AM
Last Post: snippsat
  Read file Into array with just $0d as Newline lastyle 5 3,312 Feb-03-2020, 11:58 PM
Last Post: lastyle
  Sorting a large CVS file DavidTheGrockle 1 2,014 Oct-31-2019, 12:32 PM
Last Post: ichabod801

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020