Python Forum
Automating the code using os.walk
Thread Rating:
  • 1 Vote(s) - 3 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Automating the code using os.walk
#11
In your posted code you are converting and concatenating json's files to one big csv file. You are repeatedly reading one line from one of your json's file, extracting four values you are interested in and storing them in your elements, so they can be converted to a dataframe and exported as a csv at the end.

Instead adding tweet values to elements you could just write that tweet directly to csv file - you wont "accumulate" all tweets in memory, you will just read one line with tweet, parse that single json, write it into file - memory reqs will be low.

Your code could be something like (untested, just writer and writing added):
import csv, json, os

elements_keys = ['created_at', 'text', 'lang', 'geo']
with open('outfile.csv', 'w') as csvfile:
    writer = csv.writer(csvfile)
    writer.writerow(elements_keys)   # header
    
    for dirs, subdirs, files in os.walk('/home/Dir'):
        for file in files:
            if file.endswith('.json'):
                with open(file, 'r') as input_file:
                    for line in input_file:
                        try:
                            tweet = json.loads(line)
                            row = [tweet[key] for key in elements_keys]
                            writer.writerow(row)     # writing tweet into file
                        except:
                            continue
I am not sure about performance of csv.writer when writing line by line, maybe it would be better to accumulate rows in a auxiliary list and write them at once every 10000ish rows with csv.writerows(). But for start i would try it one by one (and on smaller number of files).
Reply
#12
(Apr-03-2017, 05:46 PM)zivoni Wrote: I am not sure about performance of csv.writer when writing line by line, maybe it would be better to accumulate rows in a auxiliary list and write them at once every 10000ish rows with csv.writerows(). But for start i would try it one by one (and on smaller number of files).

So, I ran a test from hard drive using some similar 50-100MB of data--everything went well (!) -- i.e., CVS with relative data extracted was saved.

Then, I ran tests on the EC2 with data divided in 10 chunks of different sizes (6-20GB). These data had been downloaded to EC2 from original source and decompressed BZ2 -> JSON. Processing of these chunks went relatively fast without any errors; however, the result are CVS files of 26 bytes--table with headings only.

Next, I downloaded the 6GB (~1200 JSONs) 1 of 10 chunks from EC2 and ran it on my own computer -- relatively fast processing, no errors, but still only a CSV with headings  Wall The feeling is like the code doesn't recognize the files in the folder as JSONs. Very weird...

In doubts, I took a similar big chunk of data that I had stored on my computer, decompressed a small piece of it (12GB), and ran the test again using zivoni's code -- the processing went flawlessly resulting in a 522MB CSV.

Question: is there anything that EC2 Ubuntu instance can do to the decompressed BZ2 -> JSON's? I am clueless.

Update: I created a new EC2 instance on Linux and repeated my routine--download TAR from source, decompress it, then parse JSON's--same result parser "doesn't see" decompressed JSONs (i.e., CVS table has headings only). At the same time, the same way downloaded and decompressed file on my own computer parses with no issues.
Reply
#13
Then compare result of uncompressing on ec2 instance and on your own computer. Download one uncompressed file from ec2 and compare it with corresponding file uncompressed on your computer (diff or cmp on command line should be enough). Eventually you can try to parse downloaded json on your computer.

After that you should know if you need to look into the uncompressing script or the parsing one.
Reply
#14
Thank you for your comment, zivoni. I did some additional investigation and figured out that the TAR archives that I download on my computer (say, using Chrome) and that I download on EC2 (see code below) are different. Therefore, I suspect something is wrong with the way I download a TAR file (23-45GB) on EC2. Any suggestions on fast and reliable "download code" are appreciated :)

# Downloader that I used to download TARs to EC2 (found on StackOverflow)
from __future__ import ( division, absolute_import, print_function, unicode_literals )

import sys, os, tempfile, logging

if sys.version_info >= (3,):
   import urllib.request as urllib2
   import urllib.parse as urlparse
else:
   import urllib2
   import urlparse

def download_file(url, desc=None):
   u = urllib2.urlopen(url)

   scheme, netloc, path, query, fragment = urlparse.urlsplit(url)
   filename = os.path.basename(path)
   if not filename:
       filename = 'downloaded.file'
   if desc:
       filename = os.path.join(desc, filename)

   with open(filename, 'wb') as f:
       meta = u.info()
       meta_func = meta.getheaders if hasattr(meta, 'getheaders') else meta.get_all
       meta_length = meta_func("Content-Length")
       file_size = None
       if meta_length:
           file_size = int(meta_length[0])
       print("Downloading: {0} Bytes: {1}".format(url, file_size))

       file_size_dl = 0
       block_sz = 8192
       while True:
           buffer = u.read(block_sz)
           if not buffer:
               break

           file_size_dl += len(buffer)
           f.write(buffer)

           status = "{0:16}".format(file_size_dl)
           if file_size:
               status += "   [{0:6.2f}%]".format(file_size_dl * 100 / file_size)
           status += chr(13)
           print(status, end="")
       print()

   return filename

url = "https://archive.org/download/archiveteam-twitter-stream-2012-01/archiveteam-twitter-2012-01.tar"
filename = download_file(url)
print(filename)
Reply
#15
I would just try
Output:
wget https://archive.org/download/archiveteam-twitter-stream-2012-01/archiveteam-twitter-2012-01.tar
on your remote machine. wget is usually quite reliable.
Reply
#16
(Apr-04-2017, 06:40 PM)zivoni Wrote: I would just try
Output:
wget https://archive.org/download/archiveteam-twitter-stream-2012-01/archiveteam-twitter-2012-01.tar
on your remote machine. wget is usually quite reliable.

Thank you for suggestion. I am downloading the file this way now and will test it. The speed, however, is relatively slow--1-2MB/s on a 10 Gigabit channel.
Reply
#17
(Apr-04-2017, 06:48 PM)kiton Wrote:
(Apr-04-2017, 06:40 PM)zivoni Wrote: I would just try
Output:
wget https://archive.org/download/archiveteam-twitter-stream-2012-01/archiveteam-twitter-2012-01.tar
on your remote machine. wget is usually quite reliable.

Thank you for suggestion. I am downloading the file this way now and will test it. The speed, however, is relatively slow--1-2MB/s on a 10 Gigabit channel.

Your Python code isn't that fast either. There are better protocols than HTTP to transfer large files. FTP is already somewhat faster but it will not take advantage of all the bandwidth.

As a minimum compute the MD5 on both sides of the transfer and compare them.
Unless noted otherwise, code in my posts should be understood as "coding suggestions", and its use may require more neurones than the two necessary for Ctrl-C/Ctrl-V.
Your one-stop place for all your GIMP needs: gimp-forum.net
Reply
#18
Your download speed likely depends mainly on archive.org servers resources. You could try to use download accelerator like aria2 or axel (they can try to download file with multiple connections at same time), but there is no guarantee that it will be faster.

Archive.org provides torrents too, but for most files they are probably only one who seeds them.
Reply
#19
This drives me nuts... So, I used wget to download a TAR to EC2, then decompressed TAR, then decompressed enclosed BZ2's, then took a small portion of unpacked JSONs and ran the parser -- empty table with heading again :( Exactly the same routine worked fine if executed on my own computer. Totally frustrated...

When I open any JSON extracted on my computer, it always starts with: 
"{"created_at":"Fri Nov 01 06:00:00 +0000 2013","id":396154638434435072,"id_str":"396154638434435072",..." -- seems correct, as tweet's first key is "created_at". 

However, the same JSON extracted on EC2, always starts with:
"{"retweet_count":0,"in_reply_to_screen_name":null,"text":"Photo: http:\/\/t.co\/JmDhpX8V","in_reply_to_status_id_str":null,..." -- obviously incorrect, as tweet's first key shouldn't be "retweet_count".

Any feedback on this would be greatly appreciated.
Reply
#20
As Ofnuts suggested, you should check downloaded file. There is a xml file with meta informations on archive.org page with SHA1, so you can compare it (and you can compare it with file downloaded to your pc).

I am not sure if tweet export must start with "created_at" (it could depend on software used for export, on python json's are dict and they dont keep order (except 3.6)). Actually I dont think that files on ec2 are corrupt (bzip2 should report it/crash when you try to extract corrupted archive), likely just different version ...

If in that file keys have different names, or some are missing (i dont know if lang was used 5 years ago), then parsing it would raise error in innermost loop and nothing would be written.
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Automating to generate multiple arrays Robotguy 1 1,781 Nov-05-2020, 08:14 AM
Last Post: Gribouillis
  Automating PyTables Dataset Creation and Append Robotguy 1 1,768 Oct-18-2020, 08:35 PM
Last Post: jefsummers
  Automating to save generated data Robotguy 3 2,229 Aug-12-2020, 03:32 PM
Last Post: Robotguy

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020