Jul-16-2019, 02:41 PM
Hi All
I am a newbie to python world and came across a situation where i need to pick value from a string based on certain patterns.
The dataset is having two columns
Regards
Vinny
I am a newbie to python world and came across a situation where i need to pick value from a string based on certain patterns.
The dataset is having two columns
Output:Defect_id|event_description
D001|[DEFECT[: 40 D SCREEN INOP
D002|SEATS 04DE / 06DE / 05DE NO IFE / SEAT INOP (RECLINE)
D003|IFE INOP @ FOLLOWING SEATS IN SPITE OF RESET DONE IN FLIGHT : 10GH , 12EF , 16F , .., 34B , 33C , 32C , 30D , 27B , 18A , 17D , 16D , 14A , 12A.
Please find the desired output below.Output:Defect_id|affected_seats
D001|40D
D002|04D,04E,05D,05E,06D,06E
D003|10G,10H,12A,12E,12F,14A,16D,16F,17D,18A,27B,30D,32C,33C,34B
Below is my code.import re import pyspark.sql.functions as F import pyspark.sql.types as T from datasource.enrich.derivation import derives @derives("affected_seats") def parse_affected_seats(defect_description): seats_pattern = re.compile( r'\b([1-9][0-9]?[A-K])\b' ) def parse_seats(text): return sorted(list(set(seats_pattern.findall(text)))) if text else None parse_seats_udf = F.udf(parse_seats, T.ArrayType(T.StringType())) return parse_seats_udf(defect_description)]Any kind of help is highly appreciated.
Regards
Vinny