Apr-01-2019, 02:12 PM
(This post was last modified: Apr-01-2019, 02:13 PM by CaptainCsaba.)
Hi!
I have a strange problem. I have a PDF that I convert into a string using PDFminer. I change every "\n" to nothing (beause sometimes they appear at codebreaking places) I then search for a substring and then modify it a bit to match what I need. It wokrs perfectly for most cases but for some reason it does not in every situation and I don't know why. This is the code:
What I would need is is just "OK04". This is what designation1 is in this example:
<re.Match object; span=(551, 608), match='in the name of State Street Nominees Limited OK0>
For some reason in some cases it misses the last character and I have absolutely no idea why (and also the ".Vot" part but that one is not needed.) The pdfs can vary a bit in format and I have not yet figured out what formats are wrong.
What is the problem?
I have a strange problem. I have a PDF that I convert into a string using PDFminer. I change every "\n" to nothing (beause sometimes they appear at codebreaking places) I then search for a substring and then modify it a bit to match what I need. It wokrs perfectly for most cases but for some reason it does not in every situation and I don't know why. This is the code:
text1line = str(text).replace("\n", "") designation1 = str(re.search('in the name of(.*)Voting', text1line)) designation0 = re.sub('<re.*name of ', '', designation1) designation = str(designation0).split("Limited")[1]This is a part of "text1line": "registered in the name of XYZ Limited OK04.Voting rights are"
What I would need is is just "OK04". This is what designation1 is in this example:
<re.Match object; span=(551, 608), match='in the name of State Street Nominees Limited OK0>
For some reason in some cases it misses the last character and I have absolutely no idea why (and also the ".Vot" part but that one is not needed.) The pdfs can vary a bit in format and I have not yet figured out what formats are wrong.
What is the problem?