Using re to find only uppercase letters

Using re to find only uppercase letters - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: Homework (https://python-forum.io/forum-9.html)
+--- Thread: Using re to find only uppercase letters (/thread-33793.html)

Using re to find only uppercase letters - ranbarr - May-27-2021

Hi,
Im trying to solve a problem using re module, and one of the requests is to find a string with the letters ATGC only in uppercase.
this is my code:

def isVCF(file):
    num_format = re.compile(r"^chr(?:0?[1-9]|[1-9][0-9]|[MXY])\t0*[1-9][0-9]*\t[^\t]*(?:\t[ATCG]){2}\t")
    with open(file, "r+") as my_file:
        for line in my_file:
            if not num_format.match(line):
                return False
        return True

and this is an example to a line:

Output:
ChrX	74226540		T	t	50	.

The problem is that its matching the lowercase "t" aswell and I only want it to find uppercase letters.
I've tried several things but none worked.
Appreciate any kind of help!

RE: Using re to find only uppercase letters - perfringo - May-27-2021

What is exact meaning of 'find a string with the letters ATGC only in uppercase'. Does it mean 'determine whether line contains word constructed only from letters ATGC in any combination'? Or same applied to the whole file? And finally - do you have to use re? Line #2 in your code reminds me one old programmering joke: regex in plural is regrets.

RE: Using re to find only uppercase letters - nilamo - May-27-2021

Is there something that isn't shown? This shouldn't have matched: ChrX (your regex only looks for lowercase "chr")
I'm assuming that's the same reason the lowercase "t" was matched.

RE: Using re to find only uppercase letters - ranbarr - May-28-2021

(May-27-2021, 06:53 PM)perfringo Wrote: What is exact meaning of 'find a string with the letters ATGC only in uppercase'. Does it mean 'determine whether line contains word constructed only from letters ATGC in any combination'? Or same applied to the whole file? And finally - do you have to use re? Line #2 in your code reminds me one old programmering joke: regex in plural is regrets.

It means that one of the letters(ATGC - just one) appears in columns 4 and 5
and yes, sadly I have to use regex

RE: Using re to find only uppercase letters - ranbarr - May-28-2021

(May-27-2021, 09:21 PM)nilamo Wrote: Is there something that isn't shown? This shouldn't have matched: ChrX (your regex only looks for lowercase "chr")
I'm assuming that's the same reason the lowercase "t" was matched.

You are right - its actually like that:

(r"^[Cc]hr(?:0?[1-9]|[1-9][0-9]|[MXY])\t0*[1-9][0-9]*\t[^\t]*\t[ATGC]{2}

RE: Using re to find only uppercase letters - nilamo - May-28-2021

Something still seems off, as that regex won't match the string.

>>> import re
>>> test = 'ChrX        74226540        T       t       50      .'
>>> test
'ChrX\t74226540\tT\tt\t50\t.'
>>> print(test)
ChrX    74226540        T       t       50      .
>>> raw_regex = r"^[Cc]hr(?:0?[1-9]|[1-9][0-9]|[MXY])\t0*[1-9][0-9]*\t[^\t]*\t[ATGC]{2}"
>>> regex = re.compile(raw_regex)
>>> regex.match(test)
>>> regex
re.compile('^[Cc]hr(?:0?[1-9]|[1-9][0-9]|[MXY])\\t0*[1-9][0-9]*\\t[^\\t]*\\t[ATGC]{2}')

RE: Using re to find only uppercase letters - ranbarr - May-31-2021

(May-28-2021, 06:58 PM)nilamo Wrote: Something still seems off, as that regex won't match the string.

>>> import re
>>> test = 'ChrX        74226540        T       t       50      .'
>>> test
'ChrX\t74226540\tT\tt\t50\t.'
>>> print(test)
ChrX    74226540        T       t       50      .
>>> raw_regex = r"^[Cc]hr(?:0?[1-9]|[1-9][0-9]|[MXY])\t0*[1-9][0-9]*\t[^\t]*\t[ATGC]{2}"
>>> regex = re.compile(raw_regex)
>>> regex.match(test)
>>> regex
re.compile('^[Cc]hr(?:0?[1-9]|[1-9][0-9]|[MXY])\\t0*[1-9][0-9]*\\t[^\\t]*\\t[ATGC]{2}')

I kinda figured it out.. for some reason when I use the {2} its case insensitive so I just seperated it to do it twice:

def isVCF(file):
    num_format = re.compile(r"^[Cc]hr(?:0?[1-9]|[1-9][0-9]|[MXY])\t0*[1-9][0-9]*\t[^\t]*\t[ATGC]\t[ATGC]")
    with open(file, "r+") as my_file:
        for line in my_file:
            if line.startswith("#"):
                continue
            if num_format.match(line):
                return True
            else:
                return False

I used the if line.startwith to skip the headline