Python Forum
sorting a strange file
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
sorting a strange file
#1
i have a big file i need to sort. it has 2 or more whitespace separated tokens on each line. the last token has 2 or more slash separated names. i need to sort the lines in the order of the last name of the last token as the primary key and all the tokens before the last one as the secondary key with all the whitespace between them compared as if it is a single space. it looks like they are a single space but i can't be so sure because the file has about 88 million lines in 9GB. the system has 16GB RAM and 16GB swap space. i can reboot before running this sort. the sort command does not appear to have the ability to do this so i am thinking of doing this in Python. what i envision doing first is read in all lines of the file into a giant list. a sort key function would do all that funny parsing and comparison, optimized to skip parsing for the secondary keys if the primary keys are not equal. also, i need to do the comparison in a case insensitive way, but that shoulb easy enough. finally, the sorted list would be written out. does this sound fun? do i need another bottle of whiskey?
Tradition is peer pressure from dead people

What do you call someone who speaks three languages? Trilingual. Two languages? Bilingual. One language? American.
Reply
#2
I suggest a 3 steps procedure

Step 1: A python program reads the big.txt file line by line and creates a new big file bigggg.txt by replacing each line
Output:
foo/bar/baz spam/eggs ham/bacon
with
Output:
bacon foo/bar/baz spam/eggs ham/bacon| . |
Notice that the primary key has been added in front of each line and the white space between tokens has been normalized to a single space. At the end of the line, the initial blocks of white space have been written, separated by dots and enclosed between pipe characters.

Step 2 Run gnu sort on bigggg.txt, producing sbigggg.txt

Step 3 Read sbigggg.txt line by line and write sbig.txt by the reverse operation on each line.
Reply
#3
I think parsing each line and put it into sqlite3 row will be more memory efficient. Then you easy can get the sorted result and write it to a new file.
"As they say in Mexico 'dosvidaniya'. That makes two vidaniyas."
https://freedns.afraid.org
Reply
#4
(Oct-14-2018, 02:13 PM)wavic Wrote: I think parsing each line and put it into sqlite3 row will be more memory efficient.
The memory consumption in the 3 steps procedure is completely controled by the Gnu sort command. Steps 1 and 3 process the files line by line so they won't be memory expensive. I assume Gnu sort is optimized to handle very large files.
Reply
#5
I mean all these files on the hard drive.
"As they say in Mexico 'dosvidaniya'. That makes two vidaniyas."
https://freedns.afraid.org
Reply
#6
Sqlite 3 uses the hard drive too!
Reply
#7
I know.
"As they say in Mexico 'dosvidaniya'. That makes two vidaniyas."
https://freedns.afraid.org
Reply
#8
the problem with using a DBMS for storing super large files is the massive increase in processing time required.
for a few million lines this may not be an issue, but for billions or even trillions, watch out!
Reply
#9
There is a different kind of db'
"As they say in Mexico 'dosvidaniya'. That makes two vidaniyas."
https://freedns.afraid.org
Reply
#10
i forgot to mention i barely have enough space to store the sorted result, so that 3 step method will not leave me with enough space. maybe it's time to buy another 2TB USB drive.
Tradition is peer pressure from dead people

What do you call someone who speaks three languages? Trilingual. Two languages? Bilingual. One language? American.
Reply


Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020