Python Forum

Full Version: Parse using reg_ex
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Hi All
I am a newbie to python world and came across a situation where i need to pick value from a string based on certain patterns.
The dataset is having two columns
Output:
Defect_id|event_description D001|[DEFECT[: 40 D SCREEN INOP D002|SEATS 04DE / 06DE / 05DE NO IFE / SEAT INOP (RECLINE) D003|IFE INOP @ FOLLOWING SEATS IN SPITE OF RESET DONE IN FLIGHT : 10GH , 12EF , 16F , .., 34B , 33C , 32C , 30D , 27B , 18A , 17D , 16D , 14A , 12A.
Please find the desired output below.
Output:
Defect_id|affected_seats D001|40D D002|04D,04E,05D,05E,06D,06E D003|10G,10H,12A,12E,12F,14A,16D,16F,17D,18A,27B,30D,32C,33C,34B
Below is my code.

import re
import pyspark.sql.functions as F
import pyspark.sql.types as T

from datasource.enrich.derivation import derives

@derives("affected_seats")
def parse_affected_seats(defect_description):
    seats_pattern = re.compile(
        r'\b([1-9][0-9]?[A-K])\b'
    )

    def parse_seats(text):
        return sorted(list(set(seats_pattern.findall(text)))) if text else None

    parse_seats_udf = F.udf(parse_seats, T.ArrayType(T.StringType()))
    return parse_seats_udf(defect_description)]
Any kind of help is highly appreciated.

Regards
Vinny
First, it is unclear to me what the program is doing. I don't understand how the correct output is determined. What is the goal here. Second, what is the problem you are having? Are you getting an error? What is it? Is the output wrong? How is it wrong?