Python Forum
Using regex for type validation
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Using regex for type validation
#1
Hi everyone,
I'm stating to learn regex and I'm trying to solve a problem using it.
I need to write a function that gets a file and need to check some columns in the files and see if they stand the requirements.

The first column should start with “chr” and end with any number between 1-99 or one of the letters “M,X,Y”.
Second column need to be all int numbers that are bigger than 0.
4th and 5th columns need to be one of the next letters “ATCG” (only one of them).
If one of the statements are wrong even in one row it should return false.

there's the code I wrote:
def isVCF(file):
        with open(file, "r+") as my_file:
            lines = my_file.readlines()
            for line in lines:
                columns = line.split("\t")
            if (re.match(r"^chr(?:[1-9][0-9]?|[XYM])$", columns[0]) 
                and re.match(r"^[1-9][0-9]*$", columns[1])
                and re.match(r"^[ATGC]$", columns[4]) 
                and re.match(r"^[ATGC]$", columns[5])): 
                return True
            else:
                return False
Theres several problems here which I dont know how to solve:

when I print columns[0] I only get the first row of the column and not all of the rows.(which I thought the first part of the code answer)

all the regex part - I tried to run it with a file that should return True but I keep getting False.

Any help is appreciated!

Example of one line in the file:
Output:
['chrX', '74226650', '.', 'T', 'C', '50', '.', 'DP=385;VDB=0;SGB=-0.693147;RPB=0.982669;MQB=1;BQB=0.947576;MQ0F=0;AC=2;AN=2;DP4=0,95,0,289;MQ=20', 'GT:PL:DP', '1/1:78,127,0:384']
Reply
#2
The 4th and 5fth column have indices 3 and 4, we are in Python.
ranbarr likes this post
Reply
#3
(May-21-2021, 05:50 PM)Gribouillis Wrote: The 4th and 5fth column have indices 3 and 4, we are in Python.

Blush
Thank you, totally my bad
Now I do get true,
but when I try to check a file that should return False, I Keep getting True..
Reply
#4
You would get a better structured code with a function to handle each line
def is_vcf_line(line):
    columns = line.split('\t')
    return bool(
        re.match(r"^chr(?:[1-9][0-9]?|[XYM])$", columns[0]) 
        and re.match(r"^[1-9][0-9]*$", columns[1])
        and re.match(r"^[ATGC]$", columns[3]) 
        and re.match(r"^[ATGC]$", columns[4]))

def is_vcf(filename):
    with open(filename) as lines:
        return all(is_vcf_line(line) for line in lines)
Reply
#5
(May-22-2021, 07:35 AM)Gribouillis Wrote: You would get a better structured code with a function to handle each line
def is_vcf_line(line):
    columns = line.split('\t')
    return bool(
        re.match(r"^chr(?:[1-9][0-9]?|[XYM])$", columns[0]) 
        and re.match(r"^[1-9][0-9]*$", columns[1])
        and re.match(r"^[ATGC]$", columns[3]) 
        and re.match(r"^[ATGC]$", columns[4]))

def is_vcf(filename):
    with open(filename) as lines:
        return all(is_vcf_line(line) for line in lines)

..gotcha, but still when I run the tests I get false even thou I should get true
BTW, maybe I didnt mentioned it but but the second column need to be int as well, does the code check that as well?
Reply
#6
If you are getting false, it means that one of the lines doesn't match the regexes. Try to find which line it is.
Reply


Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020