Python Forum
Removing handles and links from string
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Removing handles and links from string
#1
Hello! My "Python quest" continues and I am seeking your help with the following task.

So, I have a data frame of 3199 rows with a string -text- column, e.g.:

df['text'].iloc[2001]
Output:
I miss biz!#ThingsYouDontWannaHearBeforeAKiss "Hope you like bacon and corn"@martensmi Love it. Jack and I r watching as wellYour compliments mean more than anyone else\'s.@DaveAbberger @markholston prince NEEDED to go... .196 in the postseason with like 3RBI\'s?!?! So glad he\'s gone... Now we need Inge back :)On this damn damnI\'m at Jimmys\' Lounge (Hazen, ND) http://t.co/ciYHl8J3x2I don\'t wanna fall in love to think about the future. I wanna fall in love and have fun because I\'m young and I deserve fun ya knowMy 2nd home? (@ Krause\'s SuperValu) http://t.co/4atauDdqfqVery blessed to say I have my first varsity start under my belt. Thanks to everyone who has supported me throughout the years.Pool pilates! (@ Hazen Swimming Pool) [pic]: http://t.co/zY9DIU6HHo"There are two tragedies in life, one is to lose your hearts desire, the other is to gain it." -George Bernard ShawI jus want this week to be over....@JeffGordonWeb WINS AT @MartinsvilleSwy YUSSS!!!!! #NASCAR
What code should I use in order to delete (1) Twitter handles -- i.e., words that start with "@", and (2) links that start with "http://"? 

I tried split and regex, but failed to run it successfully. Thank you in advance for help.
Reply
#2
I guess that you should use regex matching both links and @username to remove them.

I tried it with simple
pattern = r'(http://[^"\s]+)|(@\w+)'
where http://[^"\s]+ matches string starting with http:// and ending before following whitespace or ", @\w+ matches @ followed by "word" characters.

You can replace matches of regex in dataframe column "text" with
df.text.str.replace(pattern, "")
when i tried it on your text, result was
Output:
I miss biz!#ThingsYouDontWannaHearBeforeAKiss "Hope you like bacon and corn" Love it. Jack and I r watching as wellYour compliments mean more than anyone else\'s.  prince NEEDED to go... .196 in the postseason with like 3RBI\'s?!?! So glad he\'s gone... Now we need Inge back :)On this damn damnI\'m at Jimmys\' Lounge (Hazen, ND)  don\'t wanna fall in love to think about the future. I wanna fall in love and have fun because I\'m young and I deserve fun ya knowMy 2nd home? (@ Krause\'s SuperValu)  blessed to say I have my first varsity start under my belt. Thanks to everyone who has supported me throughout the years.Pool pilates! (@ Hazen Swimming Pool) [pic]: "There are two tragedies in life, one is to lose your hearts desire, the other is to gain it." -George Bernard ShawI jus want this week to be over.... WINS AT  YUSSS!!!!! #NASCAR'
Problem with such simple regex for twitter username is that it would catch parts of emails too - so it would be better to replace @\w+ with smarter pattern preserving emails
(?<=^|(?<=[^a-zA-Z0-9-_\.]))@([A-Za-z]+[A-Za-z0-9]+)
that i found on stackoverflow.
Reply
#3
Thank you for response, zivoni. Your suggestion has worked fine (both with and without "adjustment" for emails). I just have one follow-up question (likely a noob one), why the resulting output is not a table (no columns with headings), e.g.:

Output:
0        he can't bc of meI'm at McQueen Village Apts ... 1       Just finished. Such a good book! Suggestions f... 2       We've been through the worse made it through t... 3       Something's ion Understand . have enough for m... 4       You didn't eat breakfast? Good you're grounded... 5       We going to see Carrie on the 19thI'm about to...
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Removing Space between variable and string in Python coder_sw99 6 6,268 Aug-23-2022, 01:15 PM
Last Post: louries
  help with url links- href links don't work properly DeBug_0neZer0 1 1,971 Jan-06-2021, 11:01 PM
Last Post: DeBug_0neZer0
  Removing internal brackets from a string Astrikor 4 2,660 Jun-04-2020, 07:54 PM
Last Post: Astrikor

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020