Python Forum

df2 contains circa 1M rows and 9 columns. df2 starts life as a copy of df1, and only has changes made to values, with no rows or columns added or deleted.

what’s the most efficient way of creating df3 which contains only rows with changed values in df2 when compared with the same row in df1

def compare_large_dataframes(df1, df2):

    if df1.shape != df2.shape:
        raise ValueError("DataFrames must have the same number of rows and columns")

    merged_df = pd.merge(df1, df2, how='outer', indicator=True).query('_merge == "right_only"').drop('_merge', axis=1)

return merged_df

df1 and df2 have the same shape. I was using the above function, but it’s now throwing an error I’m having a hard time getting to the bottom of:

df3 = compare_large_dataframes(df1, df2)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/x/ztm.py", line 4586, in compare_large_dataframes
    merged_df = pd.merge(df1, df2, how='outer', indicator=True).query('_merge == "right_only"').drop('_merge', axis=1)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/x/mypy/lib/python3.12/site-packages/pandas/core/reshape/merge.py", line 184, in merge
    return op.get_result(copy=copy)
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/x/mypy/lib/python3.12/site-packages/pandas/core/reshape/merge.py", line 886, in get_result
    join_index, left_indexer, right_indexer = self._get_join_info()
                                              ^^^^^^^^^^^^^^^^^^^^^
  File "/home/x/mypy/lib/python3.12/site-packages/pandas/core/reshape/merge.py", line 1151, in _get_join_info
    (left_indexer, right_indexer) = self._get_join_indexers()
                                    ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/x/mypy/lib/python3.12/site-packages/pandas/core/reshape/merge.py", line 1125, in _get_join_indexers
    return get_join_indexers(
           ^^^^^^^^^^^^^^^^^^
  File "/home/x/mypy/lib/python3.12/site-packages/pandas/core/reshape/merge.py", line 1740, in get_join_indexers
    zipped = zip(*mapped)
             ^^^^^^^^^^^^
  File "/home/x/mypy/lib/python3.12/site-packages/pandas/core/reshape/merge.py", line 1737, in <genexpr>
    _factorize_keys(left_keys[n], right_keys[n], sort=sort)
  File "/home/x/mypy/lib/python3.12/site-packages/pandas/core/reshape/merge.py", line 2570, in _factorize_keys
    llab, rlab = _sort_labels(uniques, llab, rlab)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/x/mypy/lib/python3.12/site-packages/pandas/core/reshape/merge.py", line 2631, in _sort_labels
    _, new_labels = algos.safe_sort(uniques, labels, use_na_sentinel=True)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/x/mypy/lib/python3.12/site-packages/pandas/core/algorithms.py", line 1543, in safe_sort
    raise ValueError("values should be unique if codes is not None")
ValueError: values should be unique if codes is not None

Please provide an example and what you expect for output. DataFrame.compare will give you the difference between two dataframes, but the output may not be the same as you got using merge.

https://pandas.pydata.org/docs/reference...mpare.html

hobbycoder

deanhystad