Python Forum
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
numpy dtype anomaly
#1
I'm attempting to load 2 arrays from 2 columns read from a file . The file is delimited and I'm using numpy's loadtxt() function to load the arrays, like so:

#!/usr/bin/python3

import sys
import numpy as np
import os.path as op
from datetime import datetime, date, time
from io import StringIO

sample_data = StringIO("AAPL,28-01-2011, ,344.17,344.4,333.53,336.1,21144800\n\
AAPL,31-01-2011, ,335.8,340.04,334.3,339.32,13473000\n\
AAPL,01-02-2011, ,341.3,345.65,340.98,345.03,15236800\n\
AAPL,02-02-2011, ,344.45,345.25,343.55,344.32,9242600\n\
AAPL,03-02-2011, ,343.8,344.24,338.55,343.44,14064100\n\
AAPL,04-02-2011, ,343.61,346.7,343.51,346.5,11494200")

def usage():
    print("usage: {} {}".format(op.basename(sys.argv[0], 'filename')))

def get_weekday(date_str):
    return datetime.strptime(date_str, "%d-%m-%Y").date().weekday()

def load_arrays(data_file, *col_tuple):
    a1 = a2 = None
    rec_type = np.dtype([('stock_code', '|S4'), ('cob_date', '|S10'), ('filler', '|S1'), 
                        ('low_price', 'f4'), ('high_price', 'f4'), ('close_price', 'f4'), 
                        ('valuation', 'f4'), ('volume', 'uint') ])

    try:
        a1, a2 = np.loadtxt(data_file, dtype=rec_type, usecols=col_tuple, delimiter=',', unpack=True)
        # a1, a2 = np.loadtxt(data_file, usecols=col_tuple, delimiter=',', unpack=True)
    except IOError as e:
        usage() # failed to open file
    except Exception as e: print(e)

    return a1, a2

try:
    # data_file = sys.argv[0]
    data_file = sample_data
    c, v = load_arrays(data_file, 5, 6)
except IndexError:
    usage()

print("Closing price array:\n{}".format(c))
print("\nValuation array:\n{}".format(v))
When I attempt to load the arrays without any data types defined then the load is successfull,
i.e. using
a1, a2 = np.loadtxt(data_file, usecols=col_tuple, delimiter=',', unpack=True)
but when I attempt to apply data types, by specifying
a1, a2 = np.loadtxt(data_file, dtype=rec_type, usecols=col_tuple, delimiter=',', unpack=True)
I get the following output
list index out of range
Closing price array:
None

Valuation array:
None
Can anybody suggest why the difference or what I am specifying incorrectly as part of the data type specification?
Reply
#2
Hmm... If the list index is out of range, that suggests to me that your arrays have a different length than your rec_type. I would guess that rec_type has more fields than the arrays and the interpreter cannot find a corresponding index in your arrays to match up to the rec_type.
Reply
#3
I've reduced the number of columns to 4, and still the error occurs if usecols is specified. If not, it succeeds in loading each column into an array. I've taken out the missing column, which is column 2 in the previous post.
I've also included a decode('ascii') on the byte string for the date, which is the 2nd column.

So without "usecols", it works fine, as follows:
import sys
import numpy as np
import os.path as op
from datetime import datetime, date, time
from io import StringIO

sample_data = StringIO(
"AAPL,28-01-2011,344.17,344.4\n\
AAPL,31-01-2011,335.8,340.04\n\
AAPL,01-02-2011,341.3,345.65\n\
AAPL,02-02-2011,344.45,345.25\n\
AAPL,03-02-2011,343.8,344.24\n\
AAPL,04-02-2011,343.61,346.7")

def usage():
    print("usage: {} {}".format(op.basename(sys.argv[0], 'filename')))

def get_weekday(date_str):
    return datetime.strptime(date_str.decode('ascii'), "%d-%m-%Y").date().weekday()

def load_arrays(data_file, *col_tuple):
    a1 = a2 = a3 = a4 = None

  #  rec_type = np.dtype([('stock_code', 'S4'), ('cob_date', 'S10'), ('close_price', 'f4')])
    try:
        a1, a2, a3, a4 = np.loadtxt(data_file, 
                                dtype={'names': ('stock_code','cob_date','high_price','low_price'),
                                       'formats': ('S4', 'S10', 'f4', 'f4')}, 
                                converters={1: get_weekday}, delimiter=',',
                                unpack=True)

    except IOError as e:
        usage() # failed to open file
    except Exception as e: print(e)

    return a1, a2, a3, a4

try:
    data_file = sample_data
    s, d, h, l = load_arrays(data_file)
except IndexError:
    usage()

print("Stock code array:\n{}".format(s))
print("\nClose of Business date array:\n{}".format(d))
print("\nHigh price array:\n{}".format(h))
print("\nLow price array:\n{}".format(l))
But when I want to only load say column 0 and 2, then I get
list index out of range
code appears as:
def load_arrays(data_file, *col_tuple):
    a1 = a2 = None

    try:
        a1, a2 = np.loadtxt(data_file, 
                                dtype={'names': ('stock_code','cob_date','high_price','low_price'),
                                       'formats': ('S4', 'S10', 'f4', 'f4')}, 
                                converters={1: get_weekday}, delimiter=',',
                                usecols=(0,2), unpack=True)

    except IOError as e:
        usage() # failed to open file
    except Exception as e: print(e)

    return a1, a2

try:
    data_file = sample_data
    s,  h= load_arrays(data_file)
except IndexError:
    usage()
The number of columns matches what is specified in dtype.
Reply
#4
According to the documentation, the usecols parameter does this:

Quote:usecols : int or sequence, optional

Which columns to read, with 0 being the first. For example, usecols = (1,4,5) will extract the 2nd, 5th and 6th columns. The default, None, results in all columns being read.

So, you're telling it to use column 0 and column 2 only but your dtype has four columns listed. Have you tried usecols with the same number of columns as the dtype?
Reply
#5
thanks. it now works.
so basically one has to specify your record column names and data types each time you extract an arbitrary set of column number(s).
mmmm, bit clunky.
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  [Numpy] How to store different data type in one numpy array? water 7 287 Mar-26-2024, 02:18 PM
Last Post: snippsat
  FutureWarning: Logical ops (and, or, xor) between Pandas objects and dtype-less seque NewBiee 5 1,490 Sep-12-2023, 03:15 PM
Last Post: deanhystad
  Numpy returns "TypeError: unsupported operand type(s) for *: 'numpy.ufunc' and 'int'" kalle 2 2,527 Jul-19-2022, 06:31 AM
Last Post: paul18fr
  Data dtype error according to the rule "safe" AndreasMavro 5 9,004 Feb-27-2020, 10:46 PM
Last Post: Pama
  dtype in not working in mad() function ift38375 8 3,820 Jul-22-2019, 02:53 AM
Last Post: scidam
  "erlarge" a numpy-matrix to numpy-array PhysChem 2 2,926 Apr-09-2019, 04:54 PM
Last Post: PhysChem
  ValueError: Input contains infinity or a value too large for dtype('float64') Rabah_r 1 12,840 Apr-06-2019, 11:08 AM
Last Post: scidam

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020