Python Forum
How to remove patterns of characters from text
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
How to remove patterns of characters from text
#1
I'm still in the learning process with Python, I need to delete each iteration of "%", and the two characters that follow it, which should be alphanumberic, from a string. I've read through the docs for str.replace(), and filter() and didn't have any luck. Is there another function that might help me that I'm not finding in my searches?

I'm having a hard time wording my need into a search so I can find the answer myself.

Thanks for any help!
Reply
#2
this is usually ascii control of some sort, seen a lot in redirected URL's
for example:
%3A is a hexidecimal representation for ':'
%2F is a hexidecimal representation '/'
so "=https%3A%2F%2Fnews.ycombinator.com"
would be equivalent to "=https://news.ycombinator.com"

see: https://web.stanford.edu/class/archive/c...-table.png
which is a complete ASCII chart, the hex values are what the '%' will be followed with

Note that is you print the values, the codes will show their ascii values
Reply
#3
You're right, I think it was a UTF-8 format character that I'm trying to remove from a URL string so I can parse the URL and remove what I need from it. My problem is that while I know what this one is, I could get other unknown ones in the future. This time it was: %E3%80%91, a bold bracket.

If I were to get another in the future, a different one I can't foresee, my thought was to use a function that allows me to remove a pattern, %__, as this would leave me with the parts I need, no matter what other critical data was taken. I'm not quite certain how to do that otherwise.

To give you all the information, I'm pulling ASINs and Amazon product titles from a product URL so I can construct a review URL to scrape from. When the product title has a special character in it, it deforms the product title that I split off, and I can't get to the review page. When the special character's ascii code is removed, the URL sends me to the review page again, so it's value isn't critical to the call.

Is there a better way than some form of pattern removal to do this?
Reply
#4
see: https://stackoverflow.com/a/12082349
Reply
#5
(Nov-19-2022, 03:15 AM)aaander Wrote: Is there a better way than some form of pattern removal to do this?
It is URL encoding that make this based on characters that are allowed in a URI,
are either reserved or unreserved (or a percent character as part of a percent-encoding).
Example with Requests.
>>> import requests
>>> 
>>> requests.utils.unquote('%E3%80%91')
'】'
>>> requests.utils.quote('】')
'%E3%80%91'

# So if eg have this use unquote
>>> requests.utils.unquote('test%2Buser%40gmail.com')
'[email protected]'
ibreeden likes this post
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Regex Include and Exclude patterns in Same Expression starzar 2 791 May-23-2023, 09:12 AM
Last Post: Gribouillis
  The included URLconf 'scribimus.urls' does not appear to have any patterns in it. nengkya 0 1,071 Mar-03-2023, 08:29 PM
Last Post: nengkya
  How to remove footer from PDF when extracting to text jh67 3 5,073 Dec-13-2022, 06:52 AM
Last Post: DPaul
  Want to remove the text from a particular column in excel shantanu97 2 2,144 Jul-05-2021, 05:42 PM
Last Post: eddywinch82
  More elegant way to remove time from text lines. Pedroski55 6 3,929 Apr-25-2021, 03:18 PM
Last Post: perfringo
  Rename Multiple files in directory to remove special characters nyawadasi 9 6,383 Feb-16-2021, 09:49 PM
Last Post: BashBedlam
  Extracting data based on specific patterns in a text file K11 1 2,212 Aug-28-2020, 09:00 AM
Last Post: Gribouillis
  Remove escape characters / Unicode characters from string DreamingInsanity 5 13,724 May-15-2020, 01:37 PM
Last Post: snippsat
  How to Remove Non-ASCII Characters But Leave Line Breaks In Place? bmccollum 4 4,311 Apr-09-2020, 07:59 PM
Last Post: DeaD_EyE
  Highlight and remove specific string of text itsalmade 5 3,528 Dec-11-2019, 11:58 PM
Last Post: micseydel

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020