Python Forum
str.find() not returning correct index.
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
str.find() not returning correct index.
#1
Hopefully I'm not being stupid here.
I am searching through a MacOS executable for a certain string (base64 encoded) like so:
with open(self.dir, 'r', encoding="ascii", errors="ignore") as f:
    t = f.read()
    if(b64link in t):
        location = t.find(b64link)
    elif(link in t):
        location = t.find(link)
    else:
        pass
If I open the executable in a hex-editor and search for the string manually, I find that the string is at about lines 5220996 to 5221048. However if I run the above python code, it gets me an index of 2944053 which is totally off. If I open the file, seek to that location and read the length of the string, I get:
Output:
vv=>2q=G>2waQ$S y$S y$? v;<s>2t=ek>6h6}E&U >6}
which, I'm gonna be honest, doesn't look like a base64 string to me. There is absolutely no other places in the file where this string could be found. If there was then the output above would be correct.

Maybe I'm missing something obvious but how come .find() is not returning the correct index?

I've been messing around with it. It seems no matter what I search for, everything is coming up in the wrong place. When I print the contents of 't' to the console, it am able to search for the string. Yet when I use index it just gives total junk.
Reply
#2
From the testing I have done, and the fact that that testing did this to my console:
[Image: DkKQdzK.png]
The only thing I can put it down to is like unsupported characters or something, that is breaking the search. I mean it's a big stretch but there really can't be any other reason.
Reply
#3
Is there a way you could upload a sample file that shows what you're looking at? Perhaps only a few lines are necessary and could be uploaded to a pastebin.

base64 encoding and "visible in a hex editor" seem like completely different things. If you're just looking for a string that has a particular hex representation, that's not base64 encoding.

This looks like its binary data. As such identifying a particular "string" seems odd to me. How do you identify the beginning and the end of the string? By newlines or something else?
Reply
#4
(Aug-17-2020, 07:04 PM)bowlofred Wrote: base64 encoding and "visible in a hex editor" seem like completely different things. If you're just looking for a string that has a particular hex representation, that's not base64 encoding.

This looks like its binary data. As such identifying a particular "string" seems odd to me. How do you identify the beginning and the end of the string? By newlines or something else?

Sorry, I probably should have explained it better. When I say hex editor I mean basically an extended text editor which shows hex values and line numbers. It also has the function to change encoding, endianness and whatnot. I am using HexFiend
If I open up the Mach-O file with an encoding of ASCII, I will be able to find full base64 encoded strings like dGhpc2lzYmFzZTY0. The hex editor is there purely so I can view these strings. The same can be done in a disassembler like Binary Ninja.

The reason I need to look for this string is so I can replace it with my own. At the moment, I need to only replace 1 string but in the future I want to be able to replace any string with any other string. I thought the easiest way to do this would be to search through the file for the string, to make sure it is there, and also so I can get the index of the string, so I can use it later. The reason I can't just use replace() is because some of those string are base64 encoded and some are not.

The way I identify it is by looking for that whole string in the text. Take: The quick brown fox jumps over the lazy dog. If I want the word 'fox' I can just search for it in the string (using index()). The same can be applied for searching for a base64 string in the ASCII data I have.

Here's a paste bin of the ASCII encoded data. This I what I have in python since I open the file file with an encoding of ASCII. The exact string I'm searching for is: aHR0cDovL3d3dy5ib29tbGluZ3MuY29tL2RhdGFiYXNlL2Rvd25sb2FkR0pMZXZlbDIyLnBocA==
Reply
#5
Hmm. I'm not sure what's going on then. I wonder if the file is in some odd encoding that your editor is handling automatically?

If you're looking for the "aHR..." string, then your python program (with a couple tiny updates) works for me. I've saved your upload in a text file.

Output:
$ cat ascii.txt -----ASCII ENCODED DATA------- =aHR0cDovL3d3dy5ib29tbGluZ3MuY29tL2RhdGFiYXNlL2dldEdKTGV2ZWxzMjEucGhw#lvl_dataaHR0cDovL3d3dy5ib29tbGluZ3MuY29tL2RhdGFiYXNlL2dldFNhdmVEYXRhLnBocA==&page=%i&secret=%saHR0cDovL3d3dy5ib29tbGluZ3MuY29tL2RhdGFiYXNlL2dldEdKTWFwUGFja3MyMS5waHA=pack_%igauntlet_%iget_gauntlets&secret=%saHR0cDovL3d3dy5ib29tbGluZ3MuY29tL2RhdGFiYXNlL2dldEdKR2F1bnRsZXRzMjEucGhw&gauntlet=%i_%i&levelID=%i&inc=%i&extras=%i&secret=%s&rs=%i%i%s%i%s%i%s&chk=aHR0cDovL3d3dy5ib29tbGluZ3MuY29tL2RhdGFiYXNlL2Rvd25sb2FkR0pMZXZlbDIyLnBocA==%i,%i,%i,%i,%i,%i,%i,%i&levelID=%i&gameVersion=%i&secret=%sgeometry.ach.rateDiff&levelID=%i&stars=%i&secret=%ssg6pUrt0J58281aHR0cDovL3d3dy5ib29tbGluZ3MuY29tL2RhdGFiYXNlL3JhdGVHSlN0YXJzMjExLnBocA==1128989
And then this program:

b64link = "aHR0cDovL3d3dy5ib29tbGluZ3MuY29tL2RhdGFiYXNlL2Rvd25sb2FkR0pMZXZlbDIyLnBocA=="
with open("ascii.txt", 'r', encoding="ascii") as f:
    t = f.read()
    if(b64link in t):
        location = t.find(b64link)
        print(location)
Generates this output:
Output:
456
That's zero-indexed. If I open the text file in vi and goto 457, that puts me right on the the string.

Obviously we could be trying to decode the base64 bits, but that doesn't seem to be what you're trying to do.
Reply
#6
(Aug-18-2020, 04:44 AM)bowlofred Wrote: Hmm. I'm not sure what's going on then. I wonder if the file is in some odd encoding that your editor is handling automatically?
When I opened up the file in HexFeind I made sure I had the encoding set to ASII. The only difference is that the ASCII is "strict 7 bit" rather than just "ascii". When the file is converted there's obviously going to be characters that shouldn't be there like sort of unsupported character. The editor will remove these or replace them with blank but python probably doesn't and that seems like the most logical explanation.

(Aug-18-2020, 04:44 AM)bowlofred Wrote: If you're looking for the "aHR..." string, then your python program (with a couple tiny updates) works for me. I've saved your upload in a text file.

Output:
$ cat ascii.txt -----ASCII ENCODED DATA------- =aHR0cDovL3d3dy5ib29tbGluZ3MuY29tL2RhdGFiYXNlL2dldEdKTGV2ZWxzMjEucGhw#lvl_dataaHR0cDovL3d3dy5ib29tbGluZ3MuY29tL2RhdGFiYXNlL2dldFNhdmVEYXRhLnBocA==&page=%i&secret=%saHR0cDovL3d3dy5ib29tbGluZ3MuY29tL2RhdGFiYXNlL2dldEdKTWFwUGFja3MyMS5waHA=pack_%igauntlet_%iget_gauntlets&secret=%saHR0cDovL3d3dy5ib29tbGluZ3MuY29tL2RhdGFiYXNlL2dldEdKR2F1bnRsZXRzMjEucGhw&gauntlet=%i_%i&levelID=%i&inc=%i&extras=%i&secret=%s&rs=%i%i%s%i%s%i%s&chk=aHR0cDovL3d3dy5ib29tbGluZ3MuY29tL2RhdGFiYXNlL2Rvd25sb2FkR0pMZXZlbDIyLnBocA==%i,%i,%i,%i,%i,%i,%i,%i&levelID=%i&gameVersion=%i&secret=%sgeometry.ach.rateDiff&levelID=%i&stars=%i&secret=%ssg6pUrt0J58281aHR0cDovL3d3dy5ib29tbGluZ3MuY29tL2RhdGFiYXNlL3JhdGVHSlN0YXJzMjExLnBocA==1128989
And then this program:

b64link = "aHR0cDovL3d3dy5ib29tbGluZ3MuY29tL2RhdGFiYXNlL2Rvd25sb2FkR0pMZXZlbDIyLnBocA=="
with open("ascii.txt", 'r', encoding="ascii") as f:
    t = f.read()
    if(b64link in t):
        location = t.find(b64link)
        print(location)
Generates this output:
Output:
456
That's zero-indexed. If I open the text file in vi and goto 457, that puts me right on the the string.
The only difference between yours and mine is that I have the errors="ignore" otherwise it will throw an error like:
Error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xcf in position 0: ordinal not in range(128)
when decoding. This wouldn't happen with the provided text because that is in a much bigger block of text that has no characters that would cause a decoding error.

The last thing I tried was:
def getLocation(self):
    location = None

    link = self.link
    b64link = base64.b64encode(link.encode()).decode()

    b64link = self.stringXEscape(b64link.decode("ascii")).encode()
    link = self.stringXEscape(link).encode()

    with open(self.dir, 'rb') as f:
        t = f.read()
        if(b64link in t):
            location = t.index(b64link)
        elif(link in t):
            location = t.index(link)
        else:
            pass

    return location #returns integer location of the string specified in the specified file

def stringXEscape(self, string):
    return "{}".format(''.join(['\\x{:02x}'.format(ord(str(c))) for c in string]))
by comparing the raw python hex string but it doesn't work. It doesn't seem to find the string, so it never makes it past the 'if'. I've also looked for the string myself and I can't find it in the text either. I don't even know the line that it would be at so I can see what is the difference between my string and the one I am looking for.

I'll keep thinking for other ways of finding the string. It might even need to be something stupid like compressing everything to like gzip and just working with every string it its gzip compressed for, until the very end where it can be decompressed.

EDIT: I haven't posted it yet because I had one last idea to see the characters that come before the string that python thinks is correct, to see if any were like "broken" characters. Couldn't find anything. I then thought about trying to open the file with strict 7 bit ascii encoding but I couldn't find how to do that. I ended trying out latin-1 encoding at it actually works (even without ignoring errors). I get 5521001 which I believe is the correct index of the string.
Latin-1 is apparently sort of extended ascii and when I was looking for those "broken" character I actually had to use an extended ascii table.
It still doesn't really make sense why it wasn't finding the correct index originally though.
The index returned is still slightly off but only by a few tens of characters. It seems like they're null terminated so that could be why.
Reply
#7
I keep coming here to post an interring discovery and then keep realising I've been an idiot.

I said that it was a few tens of characters off, which it was. I first tried using a substring rather than seeking to the index and it didn't change anything. I then thought about using a substring with the file opened in ascii encoding (like at the beginning), not latin-1. Well, this worked and I was in the middle of typing a reply when I realised that I open the file twice and I've only ever been chasing the encoding on one of them. The other has always been ascii which is what was causing the string to be a few characters off. It is now all working which is amazing.
Reply
#8
Rather than open the file in ascii and ignore errors, I would prefer opening the file in binary and searching for the binary/byte string instead.
Reply
#9
(Aug-18-2020, 03:16 PM)bowlofred Wrote: Rather than open the file in ascii and ignore errors, I would prefer opening the file in binary and searching for the binary/byte string instead.
I like that idea more, and that was my original route but I couldn't get it to work.

----

I decided to try again and I have managed to also make it work. I am going to stick with using hex because it doesn't rely on any encoding schemes.
Reply
#10
I can't confirm this will work, since I don't have your original (non-ascii) file, but this code does the reads/compares in binary and works on the ascii snippet you gave.

b64link = b"aHR0cDovL3d3dy5ib29tbGluZ3MuY29tL2RhdGFiYXNlL2Rvd25sb2FkR0pMZXZlbDIyLnBocA=="
with open("ascii.txt", 'rb') as f:
    t = f.read()
    if(b64link in t):
        location = t.find(b64link)
        print(location)
Output:
456
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  labels.append(self.classes.index(member.find('name').text)) hobbyist 1 1,873 Dec-15-2021, 01:53 PM
Last Post: deanhystad
  pandas pivot table: How to find count for each group in Index and Column JaneTan 0 3,226 Oct-23-2021, 04:35 AM
Last Post: JaneTan
  Find index value in List Martin2998 3 2,714 May-12-2020, 02:17 PM
Last Post: deanhystad
  How to find something in a list using its index rix 1 1,706 Dec-20-2019, 04:12 PM
Last Post: stullis
  Find index of missing number parthi1705 3 3,097 May-07-2019, 10:52 AM
Last Post: avorane
  Function not returning correct value ActualNoob 3 2,655 Jan-11-2019, 12:35 AM
Last Post: stullis
  How Do I find Index of a character in string? ilcaa72 5 3,701 May-23-2018, 11:44 PM
Last Post: wavic
  find the index of "Annual" in spell_list nikhilkumar 1 5,603 Jul-12-2017, 04:56 PM
Last Post: wavic

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020