numpy dtype anomaly - Printable Version

numpy dtype anomaly - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: Data Science (https://python-forum.io/forum-44.html)
+--- Thread: numpy dtype anomaly (/thread-13900.html)

numpy dtype anomaly - bluefrog - Nov-05-2018

I'm attempting to load 2 arrays from 2 columns read from a file . The file is delimited and I'm using numpy's loadtxt() function to load the arrays, like so:

#!/usr/bin/python3

import sys
import numpy as np
import os.path as op
from datetime import datetime, date, time
from io import StringIO

sample_data = StringIO("AAPL,28-01-2011, ,344.17,344.4,333.53,336.1,21144800\n\
AAPL,31-01-2011, ,335.8,340.04,334.3,339.32,13473000\n\
AAPL,01-02-2011, ,341.3,345.65,340.98,345.03,15236800\n\
AAPL,02-02-2011, ,344.45,345.25,343.55,344.32,9242600\n\
AAPL,03-02-2011, ,343.8,344.24,338.55,343.44,14064100\n\
AAPL,04-02-2011, ,343.61,346.7,343.51,346.5,11494200")

def usage():
    print("usage: {} {}".format(op.basename(sys.argv[0], 'filename')))

def get_weekday(date_str):
    return datetime.strptime(date_str, "%d-%m-%Y").date().weekday()

def load_arrays(data_file, *col_tuple):
    a1 = a2 = None
    rec_type = np.dtype([('stock_code', '|S4'), ('cob_date', '|S10'), ('filler', '|S1'), 
                        ('low_price', 'f4'), ('high_price', 'f4'), ('close_price', 'f4'), 
                        ('valuation', 'f4'), ('volume', 'uint') ])

    try:
        a1, a2 = np.loadtxt(data_file, dtype=rec_type, usecols=col_tuple, delimiter=',', unpack=True)
        # a1, a2 = np.loadtxt(data_file, usecols=col_tuple, delimiter=',', unpack=True)
    except IOError as e:
        usage() # failed to open file
    except Exception as e: print(e)

    return a1, a2

try:
    # data_file = sys.argv[0]
    data_file = sample_data
    c, v = load_arrays(data_file, 5, 6)
except IndexError:
    usage()

print("Closing price array:\n{}".format(c))
print("\nValuation array:\n{}".format(v))

When I attempt to load the arrays without any data types defined then the load is successfull,
i.e. using

a1, a2 = np.loadtxt(data_file, usecols=col_tuple, delimiter=',', unpack=True)

but when I attempt to apply data types, by specifying

a1, a2 = np.loadtxt(data_file, dtype=rec_type, usecols=col_tuple, delimiter=',', unpack=True)

I get the following output

list index out of range
Closing price array:
None

Valuation array:
None

Can anybody suggest why the difference or what I am specifying incorrectly as part of the data type specification?

RE: numpy dtype anomaly - stullis - Nov-06-2018

Hmm... If the list index is out of range, that suggests to me that your arrays have a different length than your rec_type. I would guess that rec_type has more fields than the arrays and the interpreter cannot find a corresponding index in your arrays to match up to the rec_type.

RE: numpy dtype anomaly - bluefrog - Nov-07-2018

I've reduced the number of columns to 4, and still the error occurs if usecols is specified. If not, it succeeds in loading each column into an array. I've taken out the missing column, which is column 2 in the previous post.
I've also included a decode('ascii') on the byte string for the date, which is the 2nd column.

So without "usecols", it works fine, as follows:

import sys
import numpy as np
import os.path as op
from datetime import datetime, date, time
from io import StringIO

sample_data = StringIO(
"AAPL,28-01-2011,344.17,344.4\n\
AAPL,31-01-2011,335.8,340.04\n\
AAPL,01-02-2011,341.3,345.65\n\
AAPL,02-02-2011,344.45,345.25\n\
AAPL,03-02-2011,343.8,344.24\n\
AAPL,04-02-2011,343.61,346.7")

def usage():
    print("usage: {} {}".format(op.basename(sys.argv[0], 'filename')))

def get_weekday(date_str):
    return datetime.strptime(date_str.decode('ascii'), "%d-%m-%Y").date().weekday()

def load_arrays(data_file, *col_tuple):
    a1 = a2 = a3 = a4 = None

  #  rec_type = np.dtype([('stock_code', 'S4'), ('cob_date', 'S10'), ('close_price', 'f4')])
    try:
        a1, a2, a3, a4 = np.loadtxt(data_file, 
                                dtype={'names': ('stock_code','cob_date','high_price','low_price'),
                                       'formats': ('S4', 'S10', 'f4', 'f4')}, 
                                converters={1: get_weekday}, delimiter=',',
                                unpack=True)

    except IOError as e:
        usage() # failed to open file
    except Exception as e: print(e)

    return a1, a2, a3, a4

try:
    data_file = sample_data
    s, d, h, l = load_arrays(data_file)
except IndexError:
    usage()

print("Stock code array:\n{}".format(s))
print("\nClose of Business date array:\n{}".format(d))
print("\nHigh price array:\n{}".format(h))
print("\nLow price array:\n{}".format(l))

But when I want to only load say column 0 and 2, then I get

list index out of range

code appears as:

def load_arrays(data_file, *col_tuple):
    a1 = a2 = None

    try:
        a1, a2 = np.loadtxt(data_file, 
                                dtype={'names': ('stock_code','cob_date','high_price','low_price'),
                                       'formats': ('S4', 'S10', 'f4', 'f4')}, 
                                converters={1: get_weekday}, delimiter=',',
                                usecols=(0,2), unpack=True)

    except IOError as e:
        usage() # failed to open file
    except Exception as e: print(e)

    return a1, a2

try:
    data_file = sample_data
    s,  h= load_arrays(data_file)
except IndexError:
    usage()

The number of columns matches what is specified in dtype.

RE: numpy dtype anomaly - stullis - Nov-07-2018

According to the documentation, the usecols parameter does this:

Quote:usecols : int or sequence, optional

Which columns to read, with 0 being the first. For example, usecols = (1,4,5) will extract the 2nd, 5th and 6th columns. The default, None, results in all columns being read.

So, you're telling it to use column 0 and column 2 only but your dtype has four columns listed. Have you tried usecols with the same number of columns as the dtype?

RE: numpy dtype anomaly - bluefrog - Nov-07-2018

thanks. it now works.
so basically one has to specify your record column names and data types each time you extract an arbitrary set of column number(s).
mmmm, bit clunky.