Python Forum
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Parse using reg_ex
#1
Hi All
I am a newbie to python world and came across a situation where i need to pick value from a string based on certain patterns.
The dataset is having two columns
Output:
Defect_id|event_description D001|[DEFECT[: 40 D SCREEN INOP D002|SEATS 04DE / 06DE / 05DE NO IFE / SEAT INOP (RECLINE) D003|IFE INOP @ FOLLOWING SEATS IN SPITE OF RESET DONE IN FLIGHT : 10GH , 12EF , 16F , .., 34B , 33C , 32C , 30D , 27B , 18A , 17D , 16D , 14A , 12A.
Please find the desired output below.
Output:
Defect_id|affected_seats D001|40D D002|04D,04E,05D,05E,06D,06E D003|10G,10H,12A,12E,12F,14A,16D,16F,17D,18A,27B,30D,32C,33C,34B
Below is my code.

import re
import pyspark.sql.functions as F
import pyspark.sql.types as T

from datasource.enrich.derivation import derives

@derives("affected_seats")
def parse_affected_seats(defect_description):
    seats_pattern = re.compile(
        r'\b([1-9][0-9]?[A-K])\b'
    )

    def parse_seats(text):
        return sorted(list(set(seats_pattern.findall(text)))) if text else None

    parse_seats_udf = F.udf(parse_seats, T.ArrayType(T.StringType()))
    return parse_seats_udf(defect_description)]
Any kind of help is highly appreciated.

Regards
Vinny
Reply
#2
First, it is unclear to me what the program is doing. I don't understand how the correct output is determined. What is the goal here. Second, what is the problem you are having? Are you getting an error? What is it? Is the output wrong? How is it wrong?
Craig "Ichabod" O'Brien - xenomind.com
I wish you happiness.
Recommended Tutorials: BBCode, functions, classes, text adventures
Reply


Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020