Python Forum
Cannot Remove the Double Quotes on a Certain Word (String) Python BeautifulSoup
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Cannot Remove the Double Quotes on a Certain Word (String) Python BeautifulSoup
#1
Hi guys,

How's it going?

I've been in weeks trying to remove a double-quote (") from a word (as I want to count the word in a certain text or webpage and what not).

Been going back and forth in changing the variable to list or string in order to use their methods.

Here's my code:

import requests
from bs4 import BeautifulSoup

url = 'https://burniva.com/sam-smith-weight-loss/'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'lxml')

articles = soup.find('div', class_="cmsmasters_text")
paragraphs = articles.find_all('p')

new_p = []
for p in paragraphs:
    new_p.append(p.get_text())

for p in new_p:
    print(str(p).lstrip('"'))
There would be words like:
“Stay with Me” and “I’m not the Only One”,
"Latch"
“The Thrill of It All”
“Oh! Carol”
etc. etc....

I cannot remove those double-quotes. Tried .replace, .strip, even if I .split them up or turn them into a list. I can remove other special characters like punctuation, apostrophe, period, etc. etc.... but never the double-quotes.

Pls. try the code and you'll get the output of all the article including those that I had mentioned.

Hope someone can help me as I am now very interested to what I am missing, to why I cannot remove them and have a clean list of only just words.

Thanks in advance!
Reply
#2
The replace method is what you want. I expect the problem you are having is that not all of the examples you showed used ascii quotes, some of them are using smart quotes. So you would need to replace three times, replacing the ascii quote, the starting smart quote, and then ending smart quote.

I found these in some old code I used scraping facebook data. I think they are the single and double smart quotes:

REPLACEMENTS = (('\x32\x80\x94', '--'), ('\xe2\x80\x99', "'"), ('\xe2\x80\x98', "'"), 
	('\xe2\x80\x9c', '\\"'), ('\xe2\x80\x9d', '\\"'))
Craig "Ichabod" O'Brien - xenomind.com
I wish you happiness.
Recommended Tutorials: BBCode, functions, classes, text adventures
Reply
#3
(Oct-22-2019, 01:43 PM)soothsayerpg Wrote: I've been in weeks trying to remove a double-quote (") from a word (as I want to count the word in a certain text or webpage and what not).

There would be words like:
“Stay with Me” and “I’m not the Only One”,
"Latch"
[ ... ]

I cannot remove those double-quotes. Tried .replace, .strip, even if I .split them up or turn them into a list. I can remove other special characters like punctuation, apostrophe, period, etc. etc.... but never the double-quotes.

Hi!

It's just a thought, but I think that maybe it's because you are using different types of double quotes, and probably eliminating just one type of them.

The double quotes in “Stay with Me” are, for instance, different from the double quotes in "Latch".

All the best,
Reply
#4
Hi guys. Thanks for the response!
Did you tried it out? If you try it out and remove all the special chars, that double-quote is the one who'll remain.

I had gotten to a point, thinking, maybe I should have remove it before changing them into a list or string, but still, cannot remove. Just a wild guess though.

@ichadboi801. Will try it out, though the code you had some is a little confusing?
Reply
#5
In Python 3 when there is no b(bytes) b'hello'.
Then all text is Unicode.
So what see is what you get,what i mean bye that is that you can just copy the smart quotes and replace.
>>> p = paragraphs[0].text
>>> print(p)
With hits such as “Stay with Me” and “I’m not the Only One”, Sam Smith has become one of UK’s hottest singers.
>>> 
>>> # Now is just copy smart quotes for line over and replace
>>> print(p.replace('“', '').replace('”', ''))
With hits such as Stay with Me and I’m not the Only One, Sam Smith has become one of UK’s hottest singers.
Also using .text is shorter that get_text(),they do the same.
Reply
#6
(Oct-27-2019, 08:31 AM)soothsayerpg Wrote: Did you tried it out? If you try it out and remove all the special chars, that double-quote is the one who'll remain.

string1 = """There would be words like:
“Stay with Me” and “I’m not the Only One”,
"Latch"
“The Thrill of It All”
“Oh! Carol”
etc. etc. ..."""


print(''.join(characters for characters in string1 if characters not in '"“”'))
Output:
Output:
There would be words like: Stay with Me and I’m not the Only One, Latch The Thrill of It All Oh! Carol etc. etc. ... >>>
All the best,
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Remove double quotes from the list ? PythonDev 22 673 Nov-05-2020, 04:53 PM
Last Post: snippsat
  How to remove char from string?? ridgerunnersjw 2 159 Sep-30-2020, 03:49 PM
Last Post: ridgerunnersjw
  Quotes vs. no quotes around numbers Mark17 6 374 Aug-06-2020, 04:13 AM
Last Post: t4keheart
  Remove from end of string up to and including some character lbtdne 2 479 May-17-2020, 09:24 AM
Last Post: menator01
  Remove escape characters / Unicode characters from string DreamingInsanity 5 965 May-15-2020, 01:37 PM
Last Post: snippsat
  Python Speech recognition, word by word AceScottie 6 8,992 Apr-12-2020, 09:50 AM
Last Post: vinayakdhage
  Remove a sentence if it contains a word. lokhtar 6 721 Feb-11-2020, 04:43 PM
Last Post: stullis
  filter just with the string word jacklee26 2 588 Feb-03-2020, 03:25 PM
Last Post: snippsat
  How to get first 5 images form the document using Python BeautifulSoup sarath_unrelax 0 348 Dec-19-2019, 07:13 AM
Last Post: sarath_unrelax
  Reverse the string word sneha 2 739 Dec-12-2019, 03:37 AM
Last Post: sneha

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020