Why can't I merge pandas dataframes - learnpython2018 - Sep-23-2018
I'm just trying to learn python and started with the imdb database files. (the headers and the data files can be seen here: https://www.imdb.com/interfaces/)
When I try to merge the two different data frames. I keep getting the error key not found:
ratings = pd.read_csv('title.ratings.tsv', sep = '\t').drop_duplicates(subset = 'tconst', keep = 'first')
titles = pd.read_csv('title.akas.tsv', sep = '\t').drop_duplicates(subset = 'titleId', keep = 'first')
titles.merge(titles, ratings, left_on="titleId", right_on="tconst") I can't figure out what I'm doing wrong. Any guidance would be appreciated.
RE: Why can't I merge pandas dataframes - ichabod801 - Sep-23-2018
I don't see anything obviously wrong. The exact error you are getting would be helpful. I would also print the two datasets after you pull them but before you try the merge to make sure they are what you expect. Are you sure the error is on the merge, and not on one of the drop_duplicates?
RE: Why can't I merge pandas dataframes - learnpython2018 - Sep-23-2018
Thanks. Here is the updated code to display head:
ratings = pd.read_csv('title.ratings.tsv', sep = '\t').drop_duplicates(subset = 'tconst', keep = 'first')
titles = pd.read_csv('title.akas.tsv', sep = '\t').drop_duplicates(subset = 'titleId', keep = 'first')
print titles.head()
print ratings.head()
titles.merge(titles, ratings, left_on="titleId", right_on="tconst") The error:
Error: File "mihika1.py", line 8, in <module>
titles.merge(titles, ratings, left_on="titleId", right_on="tconst")
File "/usr/local/lib/python2.7/dist-packages/pandas/core/frame.py", line 5370, in merge
copy=copy, indicator=indicator, validate=validate)
File "/usr/local/lib/python2.7/dist-packages/pandas/core/reshape/merge.py", line 57, in merge
validate=validate)
File "/usr/local/lib/python2.7/dist-packages/pandas/core/reshape/merge.py", line 565, in __init__
self.join_names) = self._get_merge_keys()
File "/usr/local/lib/python2.7/dist-packages/pandas/core/reshape/merge.py", line 824, in _get_merge_keys
right_keys.append(right[rk]._values)
File "/usr/local/lib/python2.7/dist-packages/pandas/core/frame.py", line 2139, in __getitem__
return self._getitem_column(key)
File "/usr/local/lib/python2.7/dist-packages/pandas/core/frame.py", line 2146, in _getitem_column
return self._get_item_cache(key)
File "/usr/local/lib/python2.7/dist-packages/pandas/core/generic.py", line 1842, in _get_item_cache
values = self._data.get(item)
File "/usr/local/lib/python2.7/dist-packages/pandas/core/internals.py", line 3843, in get
loc = self.items.get_loc(item)
File "/usr/local/lib/python2.7/dist-packages/pandas/core/indexes/base.py", line 2527, in get_loc
return self._engine.get_loc(self._maybe_cast_indexer(key))
File "pandas/_libs/index.pyx", line 117, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/index.pyx", line 139, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/hashtable_class_helper.pxi", line 1265, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas/_libs/hashtable_class_helper.pxi", line 1273, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'tconst'
The output from the head
Output: sys:1: DtypeWarning: Columns (7) have mixed types. Specify dtype option on import or set low_memory=False.
titleId ordering title region language \
0 tt0000001 1 Carmencita - spanyol tánc HU \N
4 tt0000002 1 Le clown et ses chiens \N \N
10 tt0000003 1 Sarmanul Pierrot RO \N
16 tt0000004 1 Un bon bock \N \N
22 tt0000005 1 Blacksmithing Scene US \N
types attributes isOriginalTitle
0 imdbDisplay \N 0
4 original \N 1
10 imdbDisplay \N 0
16 original \N 1
22 alternative \N 0
tconst averageRating numVotes
0 tt0000001 5.8 1412
1 tt0000002 6.4 167
2 tt0000003 6.6 1006
3 tt0000004 6.4 100
4 tt0000005 6.2 1708
What a stupid mistake
titles.merge(titles, ratings, left_on="titleId", right_on="tconst") should have been:
pd.merge(titles, ratings, left_on="titleId", right_on="tconst")
|