Python Forum
Negative lookahead not working, help please
Thread Rating:
  • 3 Vote(s) - 3 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Negative lookahead not working, help please
#1
Hi everyone,

So here is my problem. I have a bunch of tweets and various metadata that I want to analyze for sociolinguistic purposes. In order to do this, I'm trying to infer users' ages thanks to the information they provide in their bio, among others. For that I'm using regular expressions to match a couple of recurring patterns in users' bio, like a user mentioning a number followed by various spellings of "years old" as in:

"John, 30 years old, engineer."

The reason why I'm using regexes for this is that there are actually very few ways people use to mention there age on Twitter, so just three or four regexes would allow me to infer the age of most users in my dataset. However, in this case I also want to check for what comes after "years old", as many people mention their children's age, and I don't want this to be incorrectly associated to the user's age, as in:

"John, father of a 12 year old kid, engineer"


So cases as the one above should be ignored, so that I can only keep users for whom a valid age can be inferred.
My program looks like this:
[code]import csv
import re

with open("test_corpus.csv") as corpus:
    corpus_read = csv.reader(corpus, delimiter=",")
    for row in corpus_read:
        if re.findall(r"\d{2}\s?(?=years old\s?|yo\s?|yr old\s?|y o\s?|yrs old\s?|year old\s?(?!son|daughter|kid|child))",row[5].lower()):
            age = re.findall(r"\d{2}\s?",row[5].lower())
            for i in age:
                print(i)[/code]
The program seems to work in some cases, but in the small test file I created to try it out, it incorrectly matches the age mentioned in the string "I have a 12 yo son", and returns 12 as a matched age, which I don't want it to. I'm guessing this has something to do with brackets or delimiters at some point in the program, but I spent a few days on it, and I could not find anything helpful on the forum, so any help would be appreciated.

Thus, the actual question is: how to make the program not recognize 12 in "John, father of a 12 year old kid, engineer" as the age of the user, based on the program I already have?


I am somewhat new at programming, so apologies if I forgot to mention something important, do not hesitate to tell me if you need more details.

Thanks in advance for any help you could provide!
Reply
#2
Regexes have always been a little bit of black magic to me, lookaheads especially so.  So, have you considered dropping the lookaheads entirely, and just trying to group the age AND the kid status?  Then you can test against the existence of the group to see if it was in the original string.

>>> import re
>>> regex = re.compile(r"(\d{2})\s?(years old|yo|yr old|y o|yrs old|year old)\s*(son|daughter|kid|child)?")
>>> test1 = "John, 30 years old, engineer."
>>> test2 = "John, father of a 12 year old kid, engineer"
>>> regex.findall(test1)
[('30', 'years old', '')]
>>> regex.findall(test2)
[('12', 'year old', 'kid')]
>>>
You can see that then, one of those has a value for the third group, and the other is just an empty string.  So you can do:
matches = regex.findall(test)
if not matches[2]:
    # do things with the age
Reply
#3
(Feb-26-2017, 01:48 PM)MitchBuchanon Wrote: Thus, the actual question is: how to make the program not recognize 12 in "John, father of a 12 year old kid, engineer" as the age of the user, based on the program I already have?


I am somewhat new at programming, so apologies if I forgot to mention something important, do not hesitate to tell me if you need more details.

Thanks in advance for any help you could provide!

You can't do it with one single regexp, because you are trying to exclude sentences containing specific strings at specific places and this cannot be done with regexps. But you can do it with one or more regexps that match what you want to exclude ("father of"...), run that on your sentence, and if it doesn't match, run a second regexp that will extract the age.
Unless noted otherwise, code in my posts should be understood as "coding suggestions", and its use may require more neurones than the two necessary for Ctrl-C/Ctrl-V.
Your one-stop place for all your GIMP needs: gimp-forum.net
Reply
#4
Thanks a lot for your help and thoughts guys, I finally managed to do it thanks to your solution Nilamo! In-between I got another different answer to that problem on a Google group on which I posted this question too, and someone came up with the following solution:

if re.search(r"\d{2}\s?(?=(?:years old|yo|yr old|y o|yrs old|year old)(?!\s?son|\s?daughter|\s?kid|\s?child))" ,row[5].lower()):

explaining that the problem in my original code was a problem of spaces not being taken into account (or something like that, I'm not sure to have fully understood the problem). So yes Ofnuts, according to this line of code it seems that it can actually be done with one single regex. So in the end, the code worked, the problem (which I realized afterwards) being that it would also take into account the age mentioned in the following example:

"I am 66 yrs old and have a 10 years old cat"

This would not necessarily be a problem, as these cases are fairly rare, but still, this is a potential bias I would have had to face.

After trying your solution Nilamo, I realized that actually the result (i.e. the new variable "matches") is actually one single list, itself being a list of the different elements present, or not, in the string. Thus, the string "John, 30 years old, engineer", would be equal to matches[0], and the actual age would in that case be matches[0][0]. It is simple, but it took me some time to realize that, thus, the final code is:

if len(matches) > 0 and matches[0][2] == "":
   age = matches[0][0]
   print(age)


The > 0 part being to avoid IndexErrors due to nothing, or no age mentioned is some users' bio.

This also solves the above-mentioned problem of strings like ""I am 66 yrs old and have a 10 years old cat", by simply not taking into account the second list present in matches.

I can now go on to the next part of this program, thanks a lot for your time and help guys, I truly appreciate it! : )
Reply
#5
(Feb-27-2017, 10:28 AM)MitchBuchanon Wrote: if len(matches) > 0 and matches[0][2] == "":
   age = matches[0][0]
   print(age)

I'm glad you got it working, but please, don't check if the length is positive.  Just check if it's a True-thy value:
if matches and not matches[0][2]:
    #do things
It just looks much better (...and follows PEP8, which is python's style guide).
Reply
#6
Quote:It just looks much better (...and follows PEP8, which is python's style guide).

Hmmm... I'm not really sure to understand what you mean by that, and how important it is, but I'll trust you on that, thanks for the advice! ; )
Reply
#7
When it comes to code, time is the most important resource. Something that minor, being easier to read, takes less time for you to think about what it does, so you can move on and add a new feature instead of trying to wrap your mind around how what you already have works.

But if nobody else will ever see your code, feel free to do whatever you want :P
For reference: https://www.python.org/dev/peps/pep-0008/
Reply
#8
Depends on what I do with the code, but as I'd like to share it for others to use it for the same purposes as mine, I may clean it up! ; )

Thanks for the details, and for the link, I'll check that! : )
Reply
#9
(Feb-28-2017, 07:23 PM)nilamo Wrote: But if nobody else will ever see your code, feel free to do whatever you want :P

Somebody else will see your code: yourself, six months later, with a hangover/flu/bad night. Don't code too cleverly.

Also, ponder this Brian Kernighan thought: "Everyone knows that debugging is twice as hard as writing a program in the first place. So if you're as clever as you can be when you write it, how will you ever debug it?".

Unless noted otherwise, code in my posts should be understood as "coding suggestions", and its use may require more neurones than the two necessary for Ctrl-C/Ctrl-V.
Your one-stop place for all your GIMP needs: gimp-forum.net
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  cmath.rect accepts a negative modulus JMB 2 304 Jan-17-2024, 08:00 PM
Last Post: JMB
  Negative indexing/selecting working and not working Andrzej_Andrzej 21 2,096 Jul-14-2023, 08:37 PM
Last Post: deanhystad
  How to do bar graph with positive and negative values different colors? Mark17 1 5,005 Jun-10-2022, 07:38 PM
Last Post: Mark17
  is there any tool to convert negative base to int? Skaperen 7 2,332 May-27-2022, 07:30 AM
Last Post: Gribouillis
  Def code does not work for negative integers. doug2019 1 1,875 Oct-31-2019, 11:00 PM
Last Post: ichabod801
  offset can not be negative in File.seek()? jollydragon 6 6,872 Sep-28-2019, 03:08 AM
Last Post: jollydragon
  Positive to negative bernardoB 6 4,281 Mar-13-2019, 07:39 PM
Last Post: bernardoB
  Python regex with negative set of characters multiline sonicblind 2 3,357 Jul-30-2018, 08:43 PM
Last Post: sonicblind
  negative to positive slices Skaperen 3 3,595 Jan-29-2018, 05:47 AM
Last Post: Skaperen
  Negative numbers and fractional powers Flexico 1 4,848 Dec-08-2016, 04:12 PM
Last Post: Larz60+

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020