Python Forum
extract only text strip byte array
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
extract only text strip byte array
#1
first of, awesome forum. you guys have been super helpful to a noob.. without guilting me into "reading the manual" this is how i learn, and glad you are helping me learn....

i have various byte strings with text in them.. the bytes are always in the beginning of the string, text is always later in the string. below are a few examples:

b'\x01\x80Q\x01\x00\x01\x12\x00\x00\x80Test Message 12345'
b'7\x00\x00\x00You find many blocks, can you send me some coin please!'
b'\x01T\x00\x00\x80{"market":"2","rate":"2600","account":"ltc1q3la84qedyf2745y8cel4wklgqalsvh8xjk7g2k"}'



i need to start from the back of the message which could be as much as 1000 characters and grab all of the text until you hit the byte array or first \ then strip the following byte characters, only leaving the text behind.. so the three lines above should return this:

Test Message 12345
You find many blocks, can you send me some coin please!
{"market":"2","rate":"2600","account":"ltc1q3la84qedyf2745y8cel4wklgqalsvh8xjk7g2k"}


any help would be great!!! Text message can be in other charsets. different languages, so im not too sure how to do this accurately. since the bytes ahead of the message are of various length.
Reply
#2
Or if someone has a better way to do this... I'm a bit stuck... maybe even split at byte end and keep only the rest? im not sure how to do that either with a random sized random byte prefix..
Reply
#3
Try this:
msgs = [
    b'\x01\x80Q\x01\x00\x01\x12\x00\x00\x80Test Message 12345',
    b'7\x00\x00\x00You find many blocks, can you send me some coin please!',
    b'\x01T\x00\x00\x80{"market":"2","rate":"2600","account":"ltc1q3la84qedyf2745y8cel4wklgqalsvh8xjk7g2k"}'
]

for msg in msgs:
    txt = str(msg).split('\\')[-1][3:-1]
    print(txt)
results:
Output:
Test Message 12345 You find many blocks, can you send me some coin please! {"market":"2","rate":"2600","account":"ltc1q3la84qedyf2745y8cel4wklgqalsvh8xjk7g2k"}
Pir8Radio likes this post
Reply
#4
(Nov-29-2022, 06:12 PM)Larz60+ Wrote: Try this:
msgs = [
    b'\x01\x80Q\x01\x00\x01\x12\x00\x00\x80Test Message 12345',
    b'7\x00\x00\x00You find many blocks, can you send me some coin please!',
    b'\x01T\x00\x00\x80{"market":"2","rate":"2600","account":"ltc1q3la84qedyf2745y8cel4wklgqalsvh8xjk7g2k"}'
]

for msg in msgs:
    txt = str(msg).split('\\')[-1][3:-1]
    print(txt)
results:
Output:
Test Message 12345 You find many blocks, can you send me some coin please! {"market":"2","rate":"2600","account":"ltc1q3la84qedyf2745y8cel4wklgqalsvh8xjk7g2k"}

is it that easy? Man I'm the worst at splits.. can you talk through this? .split('\\')[-1][3:-1] looks like we split at \ then go back 3 what is the -1?
Reply
#5
Your idea of stripping off non-printable characters is dangerous. The example below starts with bytes 0x1, 0x80, 0x51, 0x0. The danger is that 0x51 is the value of "Q". This "Q" is surrounded by other non-printable bytes, but what if it was just before "Test"? Unless you know that those leading bytes mean, you have to believe that potentially all the bytes could be printable and there won't be any "\" to look for.
Output:
b'\x01\x80Q\x01\x00\x01\x12\x00\x00\x80Test Message 12345'
I would do some research to find out what the leading bytes mean. The values probably tell you where the text starts. How did you get these bytes objects?

I did notice that the "prefix" bytes contain the length of the message.
b'\x01\x80Q\x01\x00\x01\x12\x00\x00\x80Test Message 12345' Message Length = b'\x12'
b'7\x00\x00\x00You find many blocks, can you send me some coin please!' Message Length = b'7'
b'\x01T\x00\x00\x80{"market":"2","rate":"2600","account":"ltc1q3la84qedyf2745y8cel4wklgqalsvh8xjk7g2k"}' Message Length = b'T'
Reply
#6
There is only one split here. This creates a list of str objects by splitting the string at each backslash.
A = str(msg)
B = A.split('\\')
This is indexing. This gets the last str in the list.
C = B[-1]
This is a slice. This gets a slice that starts at index 3 and ends at the last character (does not include last character)
txt = C[3:-1]
Being "the worst at splits" is a temporary condition. Now that you know you are really "the worst at slices" and maybe just a little bad at indexing you can cure that by reading about slicing here.
https://www.askpython.com/python/array/a...-in-python

You really should read about slicing and play around with slicing until you know how it works. It is very useful and your programming will suffer until you know it forward, backward and inside out.
Pir8Radio likes this post
Reply
#7
(Nov-29-2022, 08:10 PM)deanhystad Wrote: Your idea of stripping off non-printable characters is dangerous. The example below starts with bytes 0x1, 0x80, 0x51, 0x0. The danger is that 0x51 is the value of "Q". This "Q" is surrounded by other non-printable bytes, but what if it was just before "Test"? Unless you know that those leading bytes mean, you have to believe that potentially all the bytes could be printable and there won't be any "\" to look for.
Output:
b'\x01\x80Q\x01\x00\x01\x12\x00\x00\x80Test Message 12345'
I would do some research to find out what the leading bytes mean. The values probably tell you where the text starts. How did you get these bytes objects?

I did notice that the "prefix" bytes contain the length of the message.
b'\x01\x80Q\x01\x00\x01\x12\x00\x00\x80Test Message 12345' Message Length = b'\x12'
b'7\x00\x00\x00You find many blocks, can you send me some coin please!' Message Length = b'7'
b'\x01T\x00\x00\x80{"market":"2","rate":"2600","account":"ltc1q3la84qedyf2745y8cel4wklgqalsvh8xjk7g2k"}' Message Length = b'T'

the other bytes are info on what type of transaction this is, etc.. im pulling that off in other parts of the script and using it.. I just am too new and for some reason couldnt figure this out.. but im getting there.. slowly. :D
Reply
#8
(Nov-29-2022, 08:34 PM)deanhystad Wrote: There is only one split here. This creates a list of str objects by splitting the string at each backslash.
A = str(msg)
B = A.split('\\')
This is indexing. This gets the last str in the list.
C = B[-1]
This is a slice. This gets a slice that starts at index 3 and ends at the last character (does not include last character)
txt = C[3:-1]
Being "the worst at splits" is a temporary condition. Now that you know you are really "the worst at slices" and maybe just a little bad at indexing you can cure that by reading about slicing here.
https://www.askpython.com/python/array/a...-in-python

You really should read about slicing and play around with slicing until you know how it works. It is very useful and your programming will suffer until you know it forward, backward and inside out.

I really appreciate you taking the time to explain each piece to me.. this really helps.. I'm really bad at just "reading the manual" it often does me more harm than trying to reverse engineer someone else code.. but explaining real world pieces like you did is awesome, thanks!
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  How do I add comments from a text-file to an array of folders? clausneergaard 2 1,799 Feb-08-2023, 07:45 PM
Last Post: Larz60+
Smile please help me remove error for string.strip() jamie_01 3 1,211 Oct-14-2022, 07:48 AM
Last Post: Pedroski55
  Extract only certain text which are needed Calli 26 6,002 Oct-10-2022, 03:58 PM
Last Post: deanhystad
  Extract text rektcol 6 1,691 Jun-28-2022, 08:57 AM
Last Post: Gribouillis
  Can't strip extra characters from Data Canflyguy 7 1,878 Jan-10-2022, 02:16 PM
Last Post: Canflyguy
  Extract a string between 2 words from a text file OscarBoots 2 1,885 Nov-02-2021, 08:50 AM
Last Post: ibreeden
  Extract text based on postion and pattern guddu_12 2 1,644 Sep-27-2021, 08:32 PM
Last Post: guddu_12
  Extract specific sentences from text file Bubly 3 3,422 May-31-2021, 06:55 PM
Last Post: Larz60+
  extract color text from PDF Maha 0 2,082 May-31-2021, 04:05 PM
Last Post: Maha
Question How to extract multiple text from a string? chatguy 2 2,392 Feb-28-2021, 07:39 AM
Last Post: bowlofred

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020