Python Forum
Removing timestamps from transcriptions
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Removing timestamps from transcriptions
#1
There are various files of transcriptions and they have timestamps in them. They look like this ..

Quote:So from 12:23 that very moment you actually 12:25 actually continued the form of a choice, 12:28 that you kept going to the point of 12:30 no return, where you actually became who 12:33 you are now. 12:35 So you actually took part in 12:38 who you are now.

and we need them to look like this ..

Quote:So from that very moment you actually actually continued the form of a choice, that you kept going to the point of no return, where you actually became who you are now. So you actually took part in who you are now.

Is it simply a matter of a search and replace ? Like search for either numeric or a colon, and replace with a null ? Wanted to use python to parse through a number of files; there are about 20 of these .txt files and some are 130K. So doing it manually is out of the question.

Possibly search for a space and numeric to indicate the start of where it needs replacing, and more often than not we have .

Quote:access 11:30 that?

to become .

Quote:access that?

so there is usually that preceeding space to be replaced with null also.
Reply
#2
Hmm, had a bit of a look. The replacing part is easy, but it seems the searching needs to be done with Regex or str.isdigit (https://docs.python.org/3/library/stdtyp...tr.isdigit ) ?

This seems to work for the Regex side of things

import re

timestamp_regex = r'\d{2}:\d{2}'

print(bool(re.match(timestamp_regex, ' 12:23')))  # False
print(bool(re.match(timestamp_regex, '12:23 ')))  # True
print(bool(re.match(timestamp_regex, '12:23')))    # True
Reply
#3
(Dec-05-2018, 03:17 AM)jehoshua Wrote: Is it simply a matter of a search and replace ? Like search for either numeric or a colon, and replace with a null ? Wanted to use python to parse through a number of files; there are about 20 of these .txt files and some are 130K. So doing it manually is out of the question.

Possibly search for a space and numeric to indicate the start of where it needs replacing, and more often than not we have .
You could do it with regex and then just replace all double spaces with a single space. OR you can make a function to find all colons, and then remove 3 characters before, and 2 characters afterwords.
Recommended Tutorials:
Reply
#4
import re


timestamp = re.compile(r'\d{2}:\d{2} ') # <- the white space is a part of the timestamp
text = '''So from 12:23 that very moment you actually 12:25 actually continued the form of a choice, 12:28 that you kept going to the point of 12:30 no return, where you actually became who 12:33 you are now. 12:35 So you actually took part in 12:38 who you are now.'''

filtered_text = timestamp.sub('', text)
print(filtered_text)
Problem: 99:99 is also a valid match.

A better pattern:
timestamp = r'[012][0123456789]:[012345][0123456789] '
timestamp = r'[0-2]\d:[0-5]\d ' # short form
But this also allows values in timestamps like 25:00, which is an invalid time.

You can check each timestamp, if it's valid and if, then removing it.
The question is, do you need that?
Almost dead, but too lazy to die: https://sourceserver.info
All humans together. We don't need politicians!
Reply
#5
(Dec-05-2018, 02:49 PM)DeaD_EyE Wrote: But this also allows values in timestamps like 25:00, which is an invalid time.
I guess timestamps are min:sec from start, not time like hh:mm, but it's up to OP to confirm that. In more broad aspect it raise the question of what are possible values, e.g. is it possible to have mmm:ss from start.
If you can't explain it to a six year old, you don't understand it yourself, Albert Einstein
How to Ask Questions The Smart Way: link and another link
Create MCV example
Debug small programs

Reply
#6
(Dec-05-2018, 12:32 PM)metulburr Wrote: You could do it with regex and then just replace all double spaces with a single space. OR you can make a function to find all colons, and then remove 3 characters before, and 2 characters afterwords.

It's possible that there could be double spaces or colons in the transcript that are not part of the timestamp though, so it might be a bit risky ??

(Dec-05-2018, 02:49 PM)DeaD_EyE Wrote: Problem: 99:99 is also a valid match.

A better pattern:
timestamp = r'[012][0123456789]:[012345][0123456789] '
timestamp = r'[0-2]\d:[0-5]\d ' # short form
But this also allows values in timestamps like 25:00, which is an invalid time.

You can check each timestamp, if it's valid and if, then removing it.
The question is, do you need that?

I tried both of your solutions and they both worked. Had a very quick check through the timestamps with some searching, and seems 78:01 is the highest value. There are lots of values where the seconds value is '00'. The format is not hh:mm:ss , but mm:ss , so it seems having a value like 25:00 is okay.

I'm not sure if there are values like 25:60 , but would need to check as you stated.

(Dec-05-2018, 03:24 PM)buran Wrote: I guess timestamps are min:sec from start, not time like hh:mm, but it's up to OP to confirm that. In more broad aspect it raise the question of what are possible values, e.g. is it possible to have mmm:ss from start.

Yes, the format is min:sec from start, and the highest value is 78:01 , so only 2 numerics for the minutes. I guess this is a case of modifying the code to suit the data.

Thanks for those replies. :)
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  How to find tags using specific text (timestamps) in a url? q988988 1 1,334 Mar-08-2022, 08:09 AM
Last Post: buran
  Speech Recognition with timestamps DeanAseraf1 3 6,502 Jun-27-2021, 06:58 PM
Last Post: gh_ad
Bug Help on Flagging Timestamps Daring_T 2 1,828 Oct-28-2020, 08:11 PM
Last Post: Daring_T
  How to compare timestamps in python asad 2 9,052 Oct-24-2018, 03:56 AM
Last Post: asad

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020