Python Forum
How to use the re library to remove irrelevant words?
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
How to use the re library to remove irrelevant words?
#1
Hi, I'm currently making a python code which includes the creation of a wordcloud. The problem is that there are some words that are irrelevant with the wordcloud, such as "https". I tried to import re library but it doesn't work. Also, I tried the solution with the stopwords but that didn't solve the problem. Every help will be appreciated.
Reply
#2
Given that you're creating said word cloud, why would you include words that are irrelevant? And what do you mean "I tried to import re library but it doesn't work." It doesn't work how?
Sig:
>>> import this

The UNIX philosophy: "Do one thing, and do it well."

"The danger of computers becoming like humans is not as great as the danger of humans becoming like computers." :~ Konrad Zuse

"Everything should be made as simple as possible, but not simpler." :~ Albert Einstein
Reply
#3
I don't know why the cloud includes irrelevant words. I found that if I imported RE library and after I wrote the command
tweets = [re.sub(r'https\S+', '', x) for x in tweets]
it would be solved, but it didn't work.
Yoriz write Jan-11-2023, 11:36 PM:
Please post all code, output and errors (in their entirety) between their respective tags. Refer to BBCode help topic on how to post. Use the "Preview Post" button to make sure the code is presented as you expect before hitting the "Post Reply/Thread" button.
Reply
#4
Ah, right, so it's not the import that does not work, it's the sub() function pattern matching.
Is there not a way to extract the text from a tweet, rather including the URL only to have to filter it out? As is, it seems that you're creating an issue that needs to be solved, but the code snippet is a little short to fully understand your methodology.

There's a saying that goes something like, when you use regex to solve a problem, you now have two problems to solve.

edit to add:

If there's no other way, I'd try something like this:

pattern = 'https\S+'
repl = ''
result = re.sub(pattern, repl, tweets)

print(result)
Sig:
>>> import this

The UNIX philosophy: "Do one thing, and do it well."

"The danger of computers becoming like humans is not as great as the danger of humans becoming like computers." :~ Konrad Zuse

"Everything should be made as simple as possible, but not simpler." :~ Albert Einstein
Reply


Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020