Python Forum

Full Version: Using re to find only uppercase letters
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Hi,
Im trying to solve a problem using re module, and one of the requests is to find a string with the letters ATGC only in uppercase.
this is my code:
def isVCF(file):
    num_format = re.compile(r"^chr(?:0?[1-9]|[1-9][0-9]|[MXY])\t0*[1-9][0-9]*\t[^\t]*(?:\t[ATCG]){2}\t")
    with open(file, "r+") as my_file:
        for line in my_file:
            if not num_format.match(line):
                return False
        return True
and this is an example to a line:
Output:
ChrX 74226540 T t 50 .
The problem is that its matching the lowercase "t" aswell and I only want it to find uppercase letters.
I've tried several things but none worked.
Appreciate any kind of help!
What is exact meaning of 'find a string with the letters ATGC only in uppercase'. Does it mean 'determine whether line contains word constructed only from letters ATGC in any combination'? Or same applied to the whole file? And finally - do you have to use re? Line #2 in your code reminds me one old programmering joke: regex in plural is regrets.
Is there something that isn't shown? This shouldn't have matched: ChrX (your regex only looks for lowercase "chr")
I'm assuming that's the same reason the lowercase "t" was matched.
(May-27-2021, 06:53 PM)perfringo Wrote: [ -> ]What is exact meaning of 'find a string with the letters ATGC only in uppercase'. Does it mean 'determine whether line contains word constructed only from letters ATGC in any combination'? Or same applied to the whole file? And finally - do you have to use re? Line #2 in your code reminds me one old programmering joke: regex in plural is regrets.

It means that one of the letters(ATGC - just one) appears in columns 4 and 5
and yes, sadly I have to use regex
(May-27-2021, 09:21 PM)nilamo Wrote: [ -> ]Is there something that isn't shown? This shouldn't have matched: ChrX (your regex only looks for lowercase "chr")
I'm assuming that's the same reason the lowercase "t" was matched.

You are right - its actually like that:
(r"^[Cc]hr(?:0?[1-9]|[1-9][0-9]|[MXY])\t0*[1-9][0-9]*\t[^\t]*\t[ATGC]{2}
Something still seems off, as that regex won't match the string.
>>> import re
>>> test = 'ChrX        74226540        T       t       50      .'
>>> test
'ChrX\t74226540\tT\tt\t50\t.'
>>> print(test)
ChrX    74226540        T       t       50      .
>>> raw_regex = r"^[Cc]hr(?:0?[1-9]|[1-9][0-9]|[MXY])\t0*[1-9][0-9]*\t[^\t]*\t[ATGC]{2}"
>>> regex = re.compile(raw_regex)
>>> regex.match(test)
>>> regex
re.compile('^[Cc]hr(?:0?[1-9]|[1-9][0-9]|[MXY])\\t0*[1-9][0-9]*\\t[^\\t]*\\t[ATGC]{2}')
(May-28-2021, 06:58 PM)nilamo Wrote: [ -> ]Something still seems off, as that regex won't match the string.
>>> import re
>>> test = 'ChrX        74226540        T       t       50      .'
>>> test
'ChrX\t74226540\tT\tt\t50\t.'
>>> print(test)
ChrX    74226540        T       t       50      .
>>> raw_regex = r"^[Cc]hr(?:0?[1-9]|[1-9][0-9]|[MXY])\t0*[1-9][0-9]*\t[^\t]*\t[ATGC]{2}"
>>> regex = re.compile(raw_regex)
>>> regex.match(test)
>>> regex
re.compile('^[Cc]hr(?:0?[1-9]|[1-9][0-9]|[MXY])\\t0*[1-9][0-9]*\\t[^\\t]*\\t[ATGC]{2}')

I kinda figured it out.. for some reason when I use the {2} its case insensitive so I just seperated it to do it twice:
def isVCF(file):
    num_format = re.compile(r"^[Cc]hr(?:0?[1-9]|[1-9][0-9]|[MXY])\t0*[1-9][0-9]*\t[^\t]*\t[ATGC]\t[ATGC]")
    with open(file, "r+") as my_file:
        for line in my_file:
            if line.startswith("#"):
                continue
            if num_format.match(line):
                return True
            else:
                return False
I used the if line.startwith to skip the headline