Oct-28-2024, 03:28 AM
df2 contains circa 1M rows and 9 columns. df2 starts life as a copy of df1, and only has changes made to values, with no rows or columns added or deleted.
what’s the most efficient way of creating df3 which contains only rows with changed values in df2 when compared with the same row in df1
what’s the most efficient way of creating df3 which contains only rows with changed values in df2 when compared with the same row in df1
def compare_large_dataframes(df1, df2): if df1.shape != df2.shape: raise ValueError("DataFrames must have the same number of rows and columns") merged_df = pd.merge(df1, df2, how='outer', indicator=True).query('_merge == "right_only"').drop('_merge', axis=1) return merged_dfdf1 and df2 have the same shape. I was using the above function, but it’s now throwing an error I’m having a hard time getting to the bottom of:
df3 = compare_large_dataframes(df1, df2) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/x/ztm.py", line 4586, in compare_large_dataframes merged_df = pd.merge(df1, df2, how='outer', indicator=True).query('_merge == "right_only"').drop('_merge', axis=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/x/mypy/lib/python3.12/site-packages/pandas/core/reshape/merge.py", line 184, in merge return op.get_result(copy=copy) ^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/x/mypy/lib/python3.12/site-packages/pandas/core/reshape/merge.py", line 886, in get_result join_index, left_indexer, right_indexer = self._get_join_info() ^^^^^^^^^^^^^^^^^^^^^ File "/home/x/mypy/lib/python3.12/site-packages/pandas/core/reshape/merge.py", line 1151, in _get_join_info (left_indexer, right_indexer) = self._get_join_indexers() ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/x/mypy/lib/python3.12/site-packages/pandas/core/reshape/merge.py", line 1125, in _get_join_indexers return get_join_indexers( ^^^^^^^^^^^^^^^^^^ File "/home/x/mypy/lib/python3.12/site-packages/pandas/core/reshape/merge.py", line 1740, in get_join_indexers zipped = zip(*mapped) ^^^^^^^^^^^^ File "/home/x/mypy/lib/python3.12/site-packages/pandas/core/reshape/merge.py", line 1737, in <genexpr> _factorize_keys(left_keys[n], right_keys[n], sort=sort) File "/home/x/mypy/lib/python3.12/site-packages/pandas/core/reshape/merge.py", line 2570, in _factorize_keys llab, rlab = _sort_labels(uniques, llab, rlab) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/x/mypy/lib/python3.12/site-packages/pandas/core/reshape/merge.py", line 2631, in _sort_labels _, new_labels = algos.safe_sort(uniques, labels, use_na_sentinel=True) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/x/mypy/lib/python3.12/site-packages/pandas/core/algorithms.py", line 1543, in safe_sort raise ValueError("values should be unique if codes is not None") ValueError: values should be unique if codes is not None