Posts: 591
Threads: 26
Joined: Sep 2016
Jan-22-2018, 02:56 AM
(This post was last modified: Jan-22-2018, 02:17 PM by metulburr.)
This would normally go in homework discussions, but as the user has already solved their problem there is no reason to not post publicly.
Original poster was asking how to find the number of times one string occurred as a substring in another string.
https://python-forum.io/Thread-Substring-Counting
Extend this to instead give an algorithm that finds the number of times one string occurs as a subsequence in another string.
Whereas a substring must be contiguous, a subsequence need not be, though it must still be in the correct order.
For example if we count occurrences of the subsequence "an" in "banana" we get 3.
Comprised of the following: "b anana", "ban ana", "b ana na"
Algorithm should run in O(mn) time where m is the size of the string we are looking for and n is the size of the string we are looking in.
def subsequence_count(needle, haystack):
"""Your code here"""
pass
print(subsequence_count("an", "banana")) # Should print 3
print(subsequence_count("an", "trans-Panamanian banana")) # Should print 25
Posts: 4,449
Threads: 68
Joined: Jan 2018
Jan-22-2018, 08:03 AM
(This post was last modified: Jan-22-2018, 08:03 AM by Gribouillis.)
Here is a version
import unittest as ut
import re
def subsequence_count(needle, haystack):
if len(needle) == 1:
return haystack.count(needle)
elif len(needle) > 1:
a, b = needle[0], needle[1:]
result = 0
for mo in re.finditer(a, haystack):
result += subsequence_count(b, haystack[mo.end():])
return result
else:
raise ValueError(('Non empty string expected, got', needle))
class TestSubsequence(ut.TestCase):
def test_banana(self):
n = subsequence_count('an', 'banana')
self.assertEqual(n, 3)
def test_panama(self):
n = subsequence_count('an', 'trans-Panamanian banana')
self.assertEqual(n, 25)
if __name__ == '__main__':
ut.main()
Posts: 591
Threads: 26
Joined: Sep 2016
Jan-22-2018, 08:52 AM
(This post was last modified: Jan-22-2018, 08:52 AM by Mekire.)
That's pretty neat. Your recursive approach hurts my head.
I have trouble analyzing the runtime of recursive algorithms like this.
I want to say that it is O(n^m) but I'm not positive.
I base this on the idea that a run of size m,n takes:
T(m,n) = O(n) * T(m-1, n-1)
Not particularly confident in that though.
I did add a particularly pathological test however, and yours doesn't seem to manage it in a reasonable time
Output: ...
----------------------------------------------------------------------
Ran 3 tests in 346.731s
My approach went the dynamic programming route.
T[i, j] is the number of times needle[:i+1] occurs as a subsequence of haystack[:j+1] where T is our dynamic programming table.
This leads to the recursive problem definition:
T[i,j] = T[i,j-1] + T[i-1,j-1] if needle[i] == haystack[j] else T[i,j-1]
The algorithm ends up fairly straight forward from there just filling in the table row by row:
import numpy as np
import unittest as ut
def subsequence_count(needle, haystack):
n = len(haystack)
m = len(needle)
T = np.zeros((m+1, n+1), dtype=int)
T[0,:] = 1
for j,source in enumerate(haystack, start=1):
for i,search in enumerate(needle, start=1):
T[i,j] = T[i,j-1] + T[i-1,j-1] if source == search else T[i,j-1]
return T[-1,-1]
class TestSubsequence(ut.TestCase):
def test_banana(self):
n = subsequence_count('an', 'banana')
self.assertEqual(n, 3)
def test_panama(self):
n = subsequence_count('an', 'trans-Panamanian banana')
self.assertEqual(n, 25)
def test_assesses(self):
needle = "assessesassessesas"
haystack = "assessesassessesassessesassessesassessesassesses"
n = subsequence_count(needle, haystack)
self.assertEqual(n, 7963870)
if __name__ == '__main__':
ut.main() The pathological test is added at the bottom.
Mine results in:
Output: ...
----------------------------------------------------------------------
Ran 3 tests in 0.003s
OK
Posts: 4,449
Threads: 68
Joined: Jan 2018
Jan-22-2018, 10:18 AM
(This post was last modified: Jan-22-2018, 10:30 AM by Gribouillis.)
Of course, dynamic programming is the correct solution. I improved it to use only a one dimensional array. This is a nice generalization of Pascal's triangle. Actually, Pascal's triangle implements the case where the alphabet has only one letter!
import unittest as ut
def subsequence_count(needle, haystack):
T = [0 for x in needle]
pairs = [(i, needle[i]) for i in range(len(needle)-1, -1, -1)]
for source in haystack:
for i, search in pairs:
if source == search:
T[i] += T[i-1] if i > 0 else 1
return T[-1]
class TestSubsequence(ut.TestCase):
def test_banana(self):
n = subsequence_count('an', 'banana')
self.assertEqual(n, 3)
def test_panama(self):
n = subsequence_count('an', 'trans-Panamanian banana')
self.assertEqual(n, 25)
def test_nnn(self):
n = subsequence_count('nnn', 'nnnnn')
self.assertEqual(n, 10)
def test_assesses(self):
needle = "assessesassessesas"
haystack = "assessesassessesassessesassessesassessesassesses"
n = subsequence_count(needle, haystack)
self.assertEqual(n, 7963870)
if __name__ == '__main__':
ut.main() I get
. AutoExecution
....
----------------------------------------------------------------------
Ran 4 tests in 0.000s
OK
Posts: 591
Threads: 26
Joined: Sep 2016
Nice. I have seen problems like "infinite use coin-changeing," use a similar kind of idea to reduce to a 1d-array, but couldn't think of a way to do that here.
Posts: 2,955
Threads: 48
Joined: Sep 2016
def substr_count(substring, text):
return len(text.split(substring)) -1 This should work too.
sub_count('an', 'banana')
2
sub_count("an", "trans-Panamanian banana")
6
sub_count("an", "antrans-Panamanian banana")
7
sub_count("an", "antrans-Panamanian bananaan")
8
Posts: 4,449
Threads: 68
Joined: Jan 2018
Jan-22-2018, 11:41 AM
(This post was last modified: Jan-22-2018, 11:41 AM by Gribouillis.)
(Jan-22-2018, 11:38 AM)wavic Wrote: This should work too. hmmm it doesn't pass the tests...
Posts: 2,955
Threads: 48
Joined: Sep 2016
Posts: 591
Threads: 26
Joined: Sep 2016
Jan-22-2018, 12:41 PM
(This post was last modified: Jan-22-2018, 12:42 PM by Mekire.)
Did you read the top post?
In this version we are looking for subsequences, not substrings.
Quote:Extend this to instead give an algorithm that finds the number of times one string occurs as a subsequence in another string.
Whereas a substring must be contiguous, a subsequence need not be, though it must still be in the correct order.
For example if we count occurrences of the subsequence "an" in "banana" we get 3.
Comprised of the following: "banana", "banana", "banana"
I gave a couple sample input/outputs in the OP as well and Grib put them in unittests for convenience.
This is the template with tests:
import unittest as ut
def subsequence_count(needle, haystack):
"""Your code here"""
pass
class TestSubsequence(ut.TestCase):
def test_banana(self):
n = subsequence_count('an', 'banana')
self.assertEqual(n, 3)
def test_panama(self):
n = subsequence_count('an', 'trans-Panamanian banana')
self.assertEqual(n, 25)
def test_nnn(self):
n = subsequence_count('nnn', 'nnnnn')
self.assertEqual(n, 10)
def test_assesses(self):
needle = "assessesassessesas"
haystack = "assessesassessesassessesassessesassessesassesses"
n = subsequence_count(needle, haystack)
self.assertEqual(n, 7963870)
if __name__ == '__main__':
ut.main()
Posts: 2,955
Threads: 48
Joined: Sep 2016
|