Python Forum
extract only text strip byte array - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: General Coding Help (https://python-forum.io/forum-8.html)
+--- Thread: extract only text strip byte array (/thread-38829.html)



extract only text strip byte array - Pir8Radio - Nov-29-2022

first of, awesome forum. you guys have been super helpful to a noob.. without guilting me into "reading the manual" this is how i learn, and glad you are helping me learn....

i have various byte strings with text in them.. the bytes are always in the beginning of the string, text is always later in the string. below are a few examples:

b'\x01\x80Q\x01\x00\x01\x12\x00\x00\x80Test Message 12345'
b'7\x00\x00\x00You find many blocks, can you send me some coin please!'
b'\x01T\x00\x00\x80{"market":"2","rate":"2600","account":"ltc1q3la84qedyf2745y8cel4wklgqalsvh8xjk7g2k"}'



i need to start from the back of the message which could be as much as 1000 characters and grab all of the text until you hit the byte array or first \ then strip the following byte characters, only leaving the text behind.. so the three lines above should return this:

Test Message 12345
You find many blocks, can you send me some coin please!
{"market":"2","rate":"2600","account":"ltc1q3la84qedyf2745y8cel4wklgqalsvh8xjk7g2k"}


any help would be great!!! Text message can be in other charsets. different languages, so im not too sure how to do this accurately. since the bytes ahead of the message are of various length.


RE: extract only text strip byte array - Pir8Radio - Nov-29-2022

Or if someone has a better way to do this... I'm a bit stuck... maybe even split at byte end and keep only the rest? im not sure how to do that either with a random sized random byte prefix..


RE: extract only text strip byte array - Larz60+ - Nov-29-2022

Try this:
msgs = [
    b'\x01\x80Q\x01\x00\x01\x12\x00\x00\x80Test Message 12345',
    b'7\x00\x00\x00You find many blocks, can you send me some coin please!',
    b'\x01T\x00\x00\x80{"market":"2","rate":"2600","account":"ltc1q3la84qedyf2745y8cel4wklgqalsvh8xjk7g2k"}'
]

for msg in msgs:
    txt = str(msg).split('\\')[-1][3:-1]
    print(txt)
results:
Output:
Test Message 12345 You find many blocks, can you send me some coin please! {"market":"2","rate":"2600","account":"ltc1q3la84qedyf2745y8cel4wklgqalsvh8xjk7g2k"}



RE: extract only text strip byte array - Pir8Radio - Nov-29-2022

(Nov-29-2022, 06:12 PM)Larz60+ Wrote: Try this:
msgs = [
    b'\x01\x80Q\x01\x00\x01\x12\x00\x00\x80Test Message 12345',
    b'7\x00\x00\x00You find many blocks, can you send me some coin please!',
    b'\x01T\x00\x00\x80{"market":"2","rate":"2600","account":"ltc1q3la84qedyf2745y8cel4wklgqalsvh8xjk7g2k"}'
]

for msg in msgs:
    txt = str(msg).split('\\')[-1][3:-1]
    print(txt)
results:
Output:
Test Message 12345 You find many blocks, can you send me some coin please! {"market":"2","rate":"2600","account":"ltc1q3la84qedyf2745y8cel4wklgqalsvh8xjk7g2k"}

is it that easy? Man I'm the worst at splits.. can you talk through this? .split('\\')[-1][3:-1] looks like we split at \ then go back 3 what is the -1?


RE: extract only text strip byte array - deanhystad - Nov-29-2022

Your idea of stripping off non-printable characters is dangerous. The example below starts with bytes 0x1, 0x80, 0x51, 0x0. The danger is that 0x51 is the value of "Q". This "Q" is surrounded by other non-printable bytes, but what if it was just before "Test"? Unless you know that those leading bytes mean, you have to believe that potentially all the bytes could be printable and there won't be any "\" to look for.
Output:
b'\x01\x80Q\x01\x00\x01\x12\x00\x00\x80Test Message 12345'
I would do some research to find out what the leading bytes mean. The values probably tell you where the text starts. How did you get these bytes objects?

I did notice that the "prefix" bytes contain the length of the message.
b'\x01\x80Q\x01\x00\x01\x12\x00\x00\x80Test Message 12345' Message Length = b'\x12'
b'7\x00\x00\x00You find many blocks, can you send me some coin please!' Message Length = b'7'
b'\x01T\x00\x00\x80{"market":"2","rate":"2600","account":"ltc1q3la84qedyf2745y8cel4wklgqalsvh8xjk7g2k"}' Message Length = b'T'


RE: extract only text strip byte array - deanhystad - Nov-29-2022

There is only one split here. This creates a list of str objects by splitting the string at each backslash.
A = str(msg)
B = A.split('\\')
This is indexing. This gets the last str in the list.
C = B[-1]
This is a slice. This gets a slice that starts at index 3 and ends at the last character (does not include last character)
txt = C[3:-1]
Being "the worst at splits" is a temporary condition. Now that you know you are really "the worst at slices" and maybe just a little bad at indexing you can cure that by reading about slicing here.
https://www.askpython.com/python/array/array-slicing-in-python

You really should read about slicing and play around with slicing until you know how it works. It is very useful and your programming will suffer until you know it forward, backward and inside out.


RE: extract only text strip byte array - Pir8Radio - Nov-29-2022

(Nov-29-2022, 08:10 PM)deanhystad Wrote: Your idea of stripping off non-printable characters is dangerous. The example below starts with bytes 0x1, 0x80, 0x51, 0x0. The danger is that 0x51 is the value of "Q". This "Q" is surrounded by other non-printable bytes, but what if it was just before "Test"? Unless you know that those leading bytes mean, you have to believe that potentially all the bytes could be printable and there won't be any "\" to look for.
Output:
b'\x01\x80Q\x01\x00\x01\x12\x00\x00\x80Test Message 12345'
I would do some research to find out what the leading bytes mean. The values probably tell you where the text starts. How did you get these bytes objects?

I did notice that the "prefix" bytes contain the length of the message.
b'\x01\x80Q\x01\x00\x01\x12\x00\x00\x80Test Message 12345' Message Length = b'\x12'
b'7\x00\x00\x00You find many blocks, can you send me some coin please!' Message Length = b'7'
b'\x01T\x00\x00\x80{"market":"2","rate":"2600","account":"ltc1q3la84qedyf2745y8cel4wklgqalsvh8xjk7g2k"}' Message Length = b'T'

the other bytes are info on what type of transaction this is, etc.. im pulling that off in other parts of the script and using it.. I just am too new and for some reason couldnt figure this out.. but im getting there.. slowly. :D


RE: extract only text strip byte array - Pir8Radio - Nov-29-2022

(Nov-29-2022, 08:34 PM)deanhystad Wrote: There is only one split here. This creates a list of str objects by splitting the string at each backslash.
A = str(msg)
B = A.split('\\')
This is indexing. This gets the last str in the list.
C = B[-1]
This is a slice. This gets a slice that starts at index 3 and ends at the last character (does not include last character)
txt = C[3:-1]
Being "the worst at splits" is a temporary condition. Now that you know you are really "the worst at slices" and maybe just a little bad at indexing you can cure that by reading about slicing here.
https://www.askpython.com/python/array/array-slicing-in-python

You really should read about slicing and play around with slicing until you know how it works. It is very useful and your programming will suffer until you know it forward, backward and inside out.

I really appreciate you taking the time to explain each piece to me.. this really helps.. I'm really bad at just "reading the manual" it often does me more harm than trying to reverse engineer someone else code.. but explaining real world pieces like you did is awesome, thanks!