Problem Statement in English
You’re given $2$ strings - text1, and text2. Your task is to return the longest common subsequence in those $2$ strings.
How does a subsequence differ form a substring? You can skip characters in between. So for example valid subsequences in the string “beagle” are “bea”, “beg”, “eagle”, “ale” etc. So long as the order of occurence in the subsequence matches the original string, you’re good.
Approach
The solution to this actually isn’t intuitive, and I feel that unless you’ve seen this before, it’s not very likely that you’ll figure it out.
You need to arrange the $2$ strings in a grid. The first forms the topmost row, and the second, the leftmost column.
Let’s work our way through an example.
Here’s how the grid first starts out. I’m directly adding $0$ as the value in the cells so that we can update the values as we go, rather than having to add them to the grid.
I just think it looks and feels nicer this way, but hey, you do you.
| a | b | c | d | e | |
| a | $0$ | $0$ | $0$ | $0$ | $0$ |
| c | $0$ | $0$ | $0$ | $0$ | $0$ |
| e | $0$ | $0$ | $0$ | $0$ | $0$ |
In our first pass we go over the first row and see if the row character (“a” in this case) matches any column character. If it does, we increment the local longest common subsequence count by $1$.
In case it doesn’t, we use the $\text{max}$ value of the cell above, or on the left - essentially using the previously computed best length up until that point.
| a | b | c | d | e | |
| a | $1$ | $1$ | $1$ | $1$ | $1$ |
| c | $0$ | $0$ | $0$ | $0$ | $0$ |
| e | $0$ | $0$ | $0$ | $0$ | $0$ |
- When we see that the row and column char are the same we add $1$ to the diagonally left element on the top, and use that as our local LCS answer.
| a | b | c | d | e | |
| a | $1$ | $1$ | $1$ | $1$ | $1$ |
| c | $0$ | $0$ | $2$ | $2$ | $2$ |
| e | $0$ | $0$ | $0$ | $0$ | $0$ |
- And here’s the last pass.
| a | b | c | d | e | |
| a | $1$ | $1$ | $1$ | $1$ | $1$ |
| c | $0$ | $0$ | $2$ | $2$ | $2$ |
| e | $0$ | $0$ | $2$ | $2$ | $3$ |
- The final answer will always be the last cell in the last row, $3$, in this case, since “ace” is the longest common subsequence.
Solution in Python
The simple approach that just gets the job done.
class Solution:
def longestCommonSubsequence(self, text1: str, text2: str) -> int:
grid = []
for i in range(len(text1)):
# pre-populate grid row as required
grid.append([0 for _ in range(len(text2))])
for j in range(len(text2)):
# create some local vars
localLCS = 0
cellAbove = 0
cellOnLeft = 0
if text1[i] != text2[j]:
# then the lcs so far,
# is the max of cell on the right and on the top
if i >= 1:
cellAbove = grid[i-1][j]
if j>= 1:
cellOnLeft = grid[i][j-1]
localLCS = max(cellAbove, cellOnLeft)
else:
# then the lcs so far,
# is the cell on the top left diagonally + 1
if i >= 1 and j>=1:
localLCS = grid[i-1][j-1] + 1
else:
localLCS = 1
# save the localLCS to this cell
grid[i][j] = localLCS
# return the answer
return grid[-1][-1]
The space optimised approach. We ensure that we only store 2 rows at most. This row, and the previous one.
from collections import deque
class Solution:
def longestCommonSubsequence(self, text1: str, text2: str) -> int:
# we're going to guarantee that we have 2 rows atmost in the queue
grid = deque()
for i in range(len(text1)):
if len(grid) == 2:
grid.popleft()
# pre-populate grid row as required
grid.append([0 for _ in range(len(text2))])
for j in range(len(text2)):
# create some local vars
localLCS = 0
cellAbove = 0
cellOnLeft = 0
if text1[i] != text2[j]:
# then the lcs so far,
# is the max of cell on the left and on the top
if len(grid) > 1:
cellAbove = grid[-2][j]
if j >= 1:
cellOnLeft = grid[-1][j - 1]
localLCS = max(cellAbove, cellOnLeft)
else:
# then the lcs so far,
# is the cell on the top left diagonally + 1
if len(grid) > 1 and j >= 1:
localLCS = grid[-2][j - 1] + 1
else:
localLCS = 1
# save the `localLCS` to this cell
grid[-1][j] = localLCS
# return the answer
return grid[-1][-1]
Complexity
Time: $O(m\times{n})$
Here, $m$ is the number of characters that forms the first column of the grid, and $n$ is the number of characters that forms the first row of the grid.This is because we iterate over atmost $m \times n$ cells.
- Space: $O(n)$
Here, $n$ is the number of characters in the string that forms the row of the grid. (assuming we're using the space optimised version of the solution), since we only store $2n$ cells.
Mistakes I Made
I bungled up a whole bunch with the space optimised version’s implementation. Sigh.
And we are done.