Bottom Page

Thread Rating:
  • 2 Vote(s) - 3.5 Average
  • 1
  • 2
  • 3
  • 4
  • 5
 access a very large file? As an array or as a dataframe?
#1
Good morning. I have just found this forum and I am really very glad.
I have a project to do which relates to unsupervised machine learning so I use python packets and tools such as numpy, scipy, pandas primarily.
However, for a long time I've stuck at a key point, and more specifically, how to access a very large file, without crashing my system.
Let me explain.
The initial csv file contains some data and read it with loadtxt as an  4.664.604 x 3 array.

B_init = np.loadtxt(open("out.munmun_twitterex_ut"), skiprows=1, usecols=(0,1,2)).astype(int);
The next step has made me difficult a lot. I want to keep in a secont array only these rows which fullfill some conditions. 
I have thought various ways. The last one is to create two separate lists from the first and second column respectively (I am interestedn in these two columns), to eliminate the number of elements and to reconstruct the new array. But my system is crashing!
I read that dataframes from pandas that its easier the access but again its a very large file.


#the second column is a list with the users

[user_list.append(pairs[1]) for pairs in A]

#create collections to count how many times each user is appeared (me th xrhsh twn collections wa datatypes metraw poses fores emfanizetai o kathe xrhsths)
user_counter1 = Counter(user_list)
#print user_counter1, "\n"

#the first column is a list with the tags
[tags_list.append(pairs[0]) for pairs in A]
#print np.unique(tags_list)

#create collections to count how many times each tag is appeared
tags_counter1 = Counter(tags_list)
#print tags_counter1,"\n"

#keep the users with value > 80 it means each one has at least 80 tags(pairnw tous users pou exoun perissotera apo 80 Tags)
[final_users1.append(key) for key,value in user_counter1.iteritems() if value > 80]
#print len(final_users1)

#create the desired table. The systes is crashng at this point!
df = pd.DataFrame(index=np.unique(tags_list),columns=final_users1)
for s in A:
    df.loc[s[0],s[1]] = 1

df = df.fillna(0)

C = df.as_matrix()
.

Is there any idea how can  I access and compare the elements of an array of this size. I am very confused.

I don't know what else can I think and do.

Thankyou anyway
Angelika  Smile Angel Angel
Quote
#2
(May-16-2017, 11:53 AM)Angelika Wrote: Good morning. I have just found this forum and I am really very glad.
I have a project to do which relates to unsupervised machine learning so I use python packets and tools such as numpy, scipy, pandas primarily.
However, for a long time I've stuck at a key point, and more specifically, how to access a very large file, without crashing my system.
Let me explain.
The initial csv file contains some data and read it with loadtxt as an  4.664.604 x 3 array.

B_init = np.loadtxt(open("out.munmun_twitterex_ut"), skiprows=1, usecols=(0,1,2)).astype(int);
The next step has made me difficult a lot. I want to keep in a secont array only these rows which fullfill some conditions. 
I have thought various ways. The last one is to create two separate lists from the first and second column respectively (I am interestedn in these two columns), to eliminate the number of elements and to reconstruct the new array. But my system is crashing!
I read that dataframes from pandas that its easier the access but again its a very large file.


#the second column is a list with the users

[user_list.append(pairs[1]) for pairs in A]

#create collections to count how many times each user is appeared (me th xrhsh twn collections wa datatypes metraw poses fores emfanizetai o kathe xrhsths)
user_counter1 = Counter(user_list)
#print user_counter1, "\n"

#the first column is a list with the tags
[tags_list.append(pairs[0]) for pairs in A]
#print np.unique(tags_list)

#create collections to count how many times each tag is appeared
tags_counter1 = Counter(tags_list)
#print tags_counter1,"\n"

#keep the users with value > 80 it means each one has at least 80 tags(pairnw tous users pou exoun perissotera apo 80 Tags)
[final_users1.append(key) for key,value in user_counter1.iteritems() if value > 80]
#print len(final_users1)

#create the desired table. The systes is crashng at this point!
df = pd.DataFrame(index=np.unique(tags_list),columns=final_users1)
for s in A:
    df.loc[s[0],s[1]] = 1

df = df.fillna(0)

C = df.as_matrix()
.

Is there any idea how can  I access and compare the elements of an array of this size. I am very confused.

I don't know what else can I think and do.

Thankyou anyway
Angelika  Smile Angel Angel

Can you elaborate on "The systes is crashng at this point!"? Python error? Blue screen of death? System freeze? Smoke?
Unless noted otherwise, code in my posts should be understood as "coding suggestions", and its use may require more neurones than the two necessary for Ctrl-C/Ctrl-V.
Your one-stop place for all your GIMP needs: gimp-forum.net
Quote
#3
You are right, I didn't clarify what happens.
System freeze and I heve to restart the pc to use it again.
Quote
#4
Freezing with the hard disk going wild? You may be exceeding your RAM and the system starts swapping. Usually not a good thing on PCs. Otherwise your CPU just gets very busy.

Another possibility is just thermal overload. Not that many PCs are able to withstand their CPU going full blast for more than a minute. There are programs to test this.

A good thing to do is to run a CPU/memory/Disk I/O monitor when you run the program. Or start with smaller data sets and increase their size progressively (on windows, one of th tabs of the process monitor).
Unless noted otherwise, code in my posts should be understood as "coding suggestions", and its use may require more neurones than the two necessary for Ctrl-C/Ctrl-V.
Your one-stop place for all your GIMP needs: gimp-forum.net
Quote
#5
Probably running out of memory, as Ofnuts pointed.

Your initial file is not very large, int array with shape (4664604, 3) needs only about 110MB of memory. It is not clear what "A" is, I guess that it is array consisting of first two columns of B_init?  It seems that you try to create rather big dataframe with np.unique(tags_list) rows and len(final_users1) columns - it can be easily something like 100000 rows and 10000 columns (depends on your data). You are using only few millions values, so perhaps you can use something more memory efficient like sparse matrices from scipy.sparse?

for s in A:
    df.loc[s[0],s[1]] = 1
will create new columns for users (s[1] values) not in final_users1, so you should either check s[1] against final_users1 or abandon your > 80 filter.

And using list comprehension such as
[user_list.append(pairs[1]) for pairs in A]
for side effect just to save one line compared to a for loop is extremely ugly and should be avoided ...
Quote
#6
Thank you very much!
You helped me a lot!!!!
I think that found the problem!!!!
Quote

Top Page

Possibly Related Threads...
Thread Author Replies Views Last Post
  converting dataframe to int numpy array glennford49 1 127 Apr-04-2020, 06:15 AM
Last Post: snippsat
  Read file Into array with just $0d as Newline lastyle 5 374 Feb-03-2020, 11:58 PM
Last Post: lastyle
  Sorting a large CVS file DavidTheGrockle 1 214 Oct-31-2019, 12:32 PM
Last Post: ichabod801
  How to access dataframe elements SriMekala 4 591 Jul-30-2019, 01:50 AM
Last Post: scidam
  How to add a dataframe to an existing excel file wendysling 2 10,155 May-09-2019, 07:00 PM
Last Post: wendysling
  How to transform array into dataframe or table? python_newbie09 2 1,979 Mar-29-2019, 07:48 PM
Last Post: python_newbie09
  Is there any way to properly load fixed width file into a dataframe using Pandas? vicky53 1 719 Mar-29-2019, 06:04 PM
Last Post: Larz60+
  convert images into pixel dataframe into csv file using python synthex 3 8,556 Feb-17-2019, 06:26 AM
Last Post: scidam
  Write specific rows from pandas dataframe to csv file pradeepkumarbe 3 1,183 Oct-18-2018, 09:33 PM
Last Post: volcano63
  How to filter specific rows from large data file Ariane 7 3,516 Jun-29-2018, 02:43 PM
Last Post: gontajones

Forum Jump:


Users browsing this thread: 1 Guest(s)