Python Forum
Cleaning a dataset: How to extract text between two patterns
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Cleaning a dataset: How to extract text between two patterns
#1
The datasets I collected for a research project are a bit of a mess right now. While scraping questions from GitHub I came across a problem I kind of was foreseeing but now must tackle to proceed (otherwise this research will be an unorganized mess).

To summarize, all the scraped data basically exists in two parts that are now in the same column. We first have a question that has a certain amount of xp that can be earned, 0xp, 50xp, or 100xp. Hereafter there is an instruction that supports the question above by giving a short explanation of what needs to be done. In this research, both paragraphs make one question (one standard problem = question + instruction). In the picture below you see a snapshot of how the GitHub page looks like that contains one standard problem. [snapshot GitHub page] https://prnt.sc/gRzbxCDajBqA)

As you can see in the picture 2nd SC (snapshot dataset/csvfile: https://prnt.sc/um_5O4eWa3CX); the question and instruction exist in the same column, and this is repeated till the 1500th-something row (so a total of around +/- 750 standard problems). To get a clean and workable dataset for this research the questions need to be separated from the instructions but should be in the same row. This row will finally consist of ‘questions, and ‘instructions’, making these variables one case of the 750 cases.[snapshot dataset](https://i.stack.imgur.com/9U39b.png). The dataset is not as clean as it looks rn, because instructions are irregularly placed within the column containing up to multiple rows instead of just one.

The issue that arises is that it is kind of hard to get the instructions next to the question with something like pandas or ctrl+F in CSV file. Because of the quantity of the dataset and the fun of it, I want to solve the issue using a programming technique. I was thinking about the following:


1. Use a regular expression.

2. The regular expression searches for pairs within the dataset.

3. The instruction contains elements from the question, which the expression searches (without predefined delimiters) for in the CSV file.

4. If there is a match, like in the above example (snapshot dataset) (head and tail, .shape, .columns) the instruction is matched to the same row but in a new column (the instruction column).

5. The regular expression matches the below text until it finds a new question starting with ‘xp’. (so a new standard question is located)

6. This is iterated until all the questions have a neighbor which is the instructions.

Is this a proper solution or do I need to adjust the solution's direction, if so what python libraries/methods, could be used? Or do I need to skip python methods and use some other technique? Is it possible to train a RE by recognizing the row, matching it with the right instruction, and moving on to the next one?

I tried to Ctrl +f and select all the xp questions and hereafter delete the selected xp questions (after moving the xp questions to a new file with its own column). The problem that arises that not every instruction is exactly one row (could be multiple but contains info about the questions). Also, I am not sure about RE as I don't pre-define the searching variables. What would be the best step to take?

Was wondering if the following code construct could do the work:

read the input file
open an output file
initialize an output line
loop over the lines in the input file
if the current input line starts with #xp
write out the output line
begin a new output line with the current input line
else
add current input line to the existing output line with an appropriate separator (a comma if the first, a newline otherwise)
write out the final output line
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Panel Data Cleaning Drop Identity luxlambo 1 1,619 Jan-13-2020, 09:55 PM
Last Post: jefsummers
  Identifying items in a csv file that also appear in a Text extract Jaynorth 17 17,553 Sep-21-2016, 10:51 PM
Last Post: Jaynorth

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020