Python Forum

Full Version: Pandas's regular expression function result is so strange
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
I am trying to make a regular express for df1(dataframe).
I want to remove the expression related NOPOP.NoPop and NONPOP information in 3rd column.
In order to achieve quick search, I put 3rd column as a index of dataframe.
And operated it in "df.filter" way with regex.

import pandas as pd
k=[['a','b','c','NOPOP'],['d','e','f','POP'],['g','h','i','j'],['k','l','m','Pop'],['n','o','p','NoPop_AA'],['q','r','s','NONPOP']]
df_exp=pd.DataFrame(k)
df1=df_exp.set_index([3])
df2=df1.filter(regex='[^NOPOP]|[^NoPop]|[^NONPOP]', axis=0)
Output:
Out[263]: 0 1 2 3 NOPOP a b c POP d e f j g h i Pop k l m NoPop_AA n o p NONPOP q r s
The result did not delete "NOPOP.NoPop and NONPOP" related information, why not?


my desire output is just like below

Output:
0 1 2 3 POP d e f j g h i Pop k l m
Can use str.contains for this.
import pandas as pd

k = [
    ["a", "b", "c", "NOPOP"],
    ["d", "e", "f", "POP"],
    ["g", "h", "i", "j"],
    ["k", "l", "m", "Pop"],
    ["n", "o", "p", "NoPop_AA"],
    ["q", "r", "s", "NONPOP"],
]
df_exp = pd.DataFrame(k)
>>> df_exp = df_exp[~df_exp[3].str.contains('NOPOP|NoPop|NONPOP')]
>>> df1 = df_exp.set_index([3])
>>> df1
     0  1  2
3           
POP  d  e  f
j    g  h  i
Pop  k  l  m
Thank you for your quick reply. It's workable, achieved my goal.

(Jun-05-2020, 11:56 AM)snippsat Wrote: [ -> ]Can use str.contains for this.
import pandas as pd

k = [
    ["a", "b", "c", "NOPOP"],
    ["d", "e", "f", "POP"],
    ["g", "h", "i", "j"],
    ["k", "l", "m", "Pop"],
    ["n", "o", "p", "NoPop_AA"],
    ["q", "r", "s", "NONPOP"],
]
df_exp = pd.DataFrame(k)
>>> df_exp = df_exp[~df_exp[3].str.contains('NOPOP|NoPop|NONPOP')]
>>> df1 = df_exp.set_index([3])
>>> df1
     0  1  2
3           
POP  d  e  f
j    g  h  i
Pop  k  l  m
Sorry for another question.
I wonder if .str.contains includes specified functions just like re module?
For example: '^AA' expresses only searching words start with AA.

(Jun-05-2020, 11:56 AM)snippsat Wrote: [ -> ]Can use str.contains for this.
import pandas as pd

k = [
    ["a", "b", "c", "NOPOP"],
    ["d", "e", "f", "POP"],
    ["g", "h", "i", "j"],
    ["k", "l", "m", "Pop"],
    ["n", "o", "p", "NoPop_AA"],
    ["q", "r", "s", "NONPOP"],
]
df_exp = pd.DataFrame(k)
>>> df_exp = df_exp[~df_exp[3].str.contains('NOPOP|NoPop|NONPOP')]
>>> df1 = df_exp.set_index([3])
>>> df1
     0  1  2
3           
POP  d  e  f
j    g  h  i
Pop  k  l  m
(Jun-12-2020, 09:35 AM)cools0607 Wrote: [ -> ]I wonder if .str.contains includes specified functions just like re module?
Yes str.contains can take regular expression patterns as in the re module.
Quote:For example: '^AA' expresses only searching words start with AA.
Yes that would work,Pandas have a lot build in so there is also a str.startswith.
If wonder if something works,then is best to do a test.
import pandas as pd

d = {
    'Quarters' : ['quarter1','quarter2','quarter3','quarter4'],
     'Description': ['AA year', 'BB year', 'CC year', 'AA year'],
     'Revenue': [23.5, 54.6, 5.45, 41.87]
}
df = pd.DataFrame(d)
Test usage:
>>> df[df['Description'].str.contains(r'^AA')]
  Description  Quarters  Revenue
0     AA year  quarter1    23.50
3     AA year  quarter4    41.87
>>> df[df['Description'].str.contains(r'^AA|BB')]
  Description  Quarters  Revenue
0     AA year  quarter1    23.50
1     BB year  quarter2    54.60
3     AA year  quarter4    41.87

>>> # Using str.startswith
>>> df[df['Description'].str.startswith('AA')]
  Description  Quarters  Revenue
0     AA year  quarter1    23.50
3     AA year  quarter4    41.87
>>> df[df['Description'].str.startswith(('AA', 'BB'))]
  Description  Quarters  Revenue
0     AA year  quarter1    23.50
1     BB year  quarter2    54.60
3     AA year  quarter4    41.87 
Thank you for your reply. After trying your code, I got it. I think it is convenient for me to use .str.contains(r'^AA').
(Jun-12-2020, 10:14 AM)snippsat Wrote: [ -> ]
(Jun-12-2020, 09:35 AM)cools0607 Wrote: [ -> ]I wonder if .str.contains includes specified functions just like re module?
Yes str.contains can take regular expression patterns as in the re module.
Quote:For example: '^AA' expresses only searching words start with AA.
Yes that would work,Pandas have a lot build in so there is also a str.startswith.
If wonder if something works,then is best to do a test.
import pandas as pd

d = {
    'Quarters' : ['quarter1','quarter2','quarter3','quarter4'],
     'Description': ['AA year', 'BB year', 'CC year', 'AA year'],
     'Revenue': [23.5, 54.6, 5.45, 41.87]
}
df = pd.DataFrame(d)
Test usage:
>>> df[df['Description'].str.contains(r'^AA')]
  Description  Quarters  Revenue
0     AA year  quarter1    23.50
3     AA year  quarter4    41.87
>>> df[df['Description'].str.contains(r'^AA|BB')]
  Description  Quarters  Revenue
0     AA year  quarter1    23.50
1     BB year  quarter2    54.60
3     AA year  quarter4    41.87

>>> # Using str.startswith
>>> df[df['Description'].str.startswith('AA')]
  Description  Quarters  Revenue
0     AA year  quarter1    23.50
3     AA year  quarter4    41.87
>>> df[df['Description'].str.startswith(('AA', 'BB'))]
  Description  Quarters  Revenue
0     AA year  quarter1    23.50
1     BB year  quarter2    54.60
3     AA year  quarter4    41.87 
sorry for another question.
I tried to search lots of data from Excel. After importing data to list(data structure).
I tried two methods.
1. using list with re module search.
2. Transfer list --> dataframe and then apply with .str.contains() method
Both of them can be workable. But dataframe is more slower than pandas dataframe. Is it reasonable?
PS: python console shows below user warning
UserWarning: This pattern has match groups. To actually get the groups, use str.extract.
  return func(self, *args, **kwargs)
(Jun-12-2020, 10:14 AM)snippsat Wrote: [ -> ]
(Jun-12-2020, 09:35 AM)cools0607 Wrote: [ -> ]I wonder if .str.contains includes specified functions just like re module?
Yes str.contains can take regular expression patterns as in the re module.
Quote:For example: '^AA' expresses only searching words start with AA.
Yes that would work,Pandas have a lot build in so there is also a str.startswith.
If wonder if something works,then is best to do a test.
import pandas as pd

d = {
    'Quarters' : ['quarter1','quarter2','quarter3','quarter4'],
     'Description': ['AA year', 'BB year', 'CC year', 'AA year'],
     'Revenue': [23.5, 54.6, 5.45, 41.87]
}
df = pd.DataFrame(d)
Test usage:
>>> df[df['Description'].str.contains(r'^AA')]
  Description  Quarters  Revenue
0     AA year  quarter1    23.50
3     AA year  quarter4    41.87
>>> df[df['Description'].str.contains(r'^AA|BB')]
  Description  Quarters  Revenue
0     AA year  quarter1    23.50
1     BB year  quarter2    54.60
3     AA year  quarter4    41.87

>>> # Using str.startswith
>>> df[df['Description'].str.startswith('AA')]
  Description  Quarters  Revenue
0     AA year  quarter1    23.50
3     AA year  quarter4    41.87
>>> df[df['Description'].str.startswith(('AA', 'BB'))]
  Description  Quarters  Revenue
0     AA year  quarter1    23.50
1     BB year  quarter2    54.60
3     AA year  quarter4    41.87