Problem

Given a string s and a dictionary of strings wordDict, return true if s can be segmented into a space-separated sequence of one or more dictionary words.

Note that the same word in the dictionary may be reused multiple times in the segmentation.

https://leetcode.com/problems/word-break/

Example 1:

Input: s = "leetcode", wordDict = ["leet","code"]
Output: true
Explanation: Return true because “leetcode” can be segmented as “leet code”.

Example 2:

Input: s = "applepenapple", wordDict = ["apple","pen"]
Output: true
Explanation: Return true because “applepenapple” can be segmented as “apple pen apple”.
Note that you are allowed to reuse a dictionary word.

Example 3:

Input: s = "catsandog", wordDict = ["cats","dog","sand","and","cat"]
Output: false

Constraints:

  • 1 <= s.length <= 300
  • 1 <= wordDict.length <= 1000
  • 1 <= wordDict[i].length <= 20
  • s and wordDict[i] consist of only lowercase English letters.
  • All the strings of wordDict are unique.

Test Cases

1
2
class Solution:
def wordBreak(self, s: str, wordDict: List[str]) -> bool:
solution_test.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
import pytest

from solution import Solution
from solution2 import Solution as Solution2


@pytest.mark.parametrize('s, wordDict, expected', [
("leetcode", ["leet","code"], True),
("applepenapple", ["apple","pen"], True),
("catsandog", ["cats","dog","sand","and","cat"], False),
])
class Test:
def test_solution(self, s, wordDict, expected):
sol = Solution()
assert sol.wordBreak(s, wordDict) == expected

def test_solution2(self, s, wordDict, expected):
sol = Solution2()
assert sol.wordBreak(s, wordDict) == expected

Thoughts

设字符串 s 的长度为 n,si,js_{i,j} 表示其中的一个子串(1ijn1\le i\le j\le n)。设 wordDict 构成的集合为 D

cani,jcan_{i,j} 表示子串 si,js_{i,j} 对应的结果,值为 truefalse

那么有:

cani,j=si,jDp:(ip<j,cani,pcanp+1,j)can_{i,j}=s_{i,j}\in D\lor \exists p:(i\le p<j,can_{i,p}\land can_{p+1,j})

正向计算会有大量递归和重复计算,所以用动态规划算法,用二维数组保存 cani,jcan_{i,j} 的值,并从最小间隔的 i、j 开始。

判定 si,jDs_{i,j}\in D 时,先根据 si,js_{i,j} 的长度剪枝,即比 wordDict 中所有单词都短或都长的,可以直接跳过。

可以把 wordDict 中所有单词放在哈希表中,直接查表。设单词的平均长度为 w,那么一次查询的时间复杂度为 O(w)

或者可以自行构建 trie 树,时空复杂度与用哈希表差不多(试了一版,并不比直接用 python 的 set 快)。

整体空间复杂度为 O(n^2),时间复杂度为 O(w * n^3),当 w << n 时,约等于 O(n^3)

考虑到为 true 的 cani,jcan_{i,j} 相对会很少,可以用哈希表/哈希集合,只记录值为 true 的 (i, j) 对,空间复杂度会低很多。

Code

solution.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
from typing import List


class Solution:
def wordBreak(self, s: str, wordDict: List[str]) -> bool:
min_w = min(len(word) for word in wordDict)
max_w = max(len(word) for word in wordDict)
words = set(wordDict)

n = len(s)
# buffer = [[False] * n for _ in range(n)]
buffer = set()
for w in range(min_w, n + 1):
for i in range(n - w + 1):
j = i + w - 1
can = w <= max_w and s[i:j+1] in words
# can = can or any(buffer[i][p] and buffer[p+1][j] for p in range(i, j))
# buffer[i][j] = can
can = can or any((i, p) in buffer and (p + 1, j) in buffer for p in range(i, j))
if can:
buffer.add((i, j))

# return buffer[0][n - 1]
return (0, n - 1) in buffer

Improve

既然值为 true 的 cani,jcan_{i,j} 相对会很少,也就说明有大量结果为 false 的计算是无用的,需要根据题目的场景做特定的裁剪,明显不可能的 i、j 对不应该计算。

考虑把 j 固定在字符串的最右端,即 j = n。另外设 wordDict (集合 D)中单词长度的最小最大值分别是 wminw_{min}wmaxw_{max}。那么:

cani,n=w:{wminwwmax,p=defi+w1,ipn,si,pD,p=ncanp+1,n=truecan_{i,n}=\exists w:\begin{cases} w_{min}\le w\le w_{max}, \\ p\overset{\underset{\mathrm{def}}{}}{=}i+w-1, \\ i\le p\le n, \\ s_{i,p}\in D, \\ p=n\lor can_{p+1,n}=true \end{cases}

也就是对于子串 si,ns_{i,n},从左端开始找出所有能匹配到的单词。假设能找到一个单词 si,ps_{i,p},对应的 canp+1,ncan_{p+1,n} 亦为 true,则 cani,ncan_{i,n} 为 true。

上边没发挥有效作用的 trie 树,在这里可以以更快的速度从字符串 si,ns_{i,n} 的起始位置找到所有单词。比如单词表为 ["dog", "dogs"],利用 trie 树可以一次扫描从 "dogsandcat" 中找出 "dog""dogs"

从 n 到 1 逆向遍历 i,循环一次即可。整体时间复杂度为 O(w * n),其中 w 是 wordDict 中单词长度均值。空间复杂度为 O(n) 加上单词哈希表或 trie 树的大小。

优化之后的代码:

solution2.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
from typing import Generator, List


class TrieNode:
def __init__(self):
self.is_word = False
self.children: dict[str, 'TrieNode'] = {}


class TrieTree:
def __init__(self, words: list[str]):
self._root = TrieNode()
for word in words:
self.add(word)

def add(self, word: str):
node = self._root
for c in word:
if (child := node.children.get(c)) is None:
child = node.children[c] = TrieNode()

node = child

node.is_word = True

def lookup(self, text: str, start: int) -> Generator[int, None, None]:
node = self._root
if node.is_word:
yield start

n = len(text)
while start < n:
node = node.children.get(text[start])
if node is None:
break

start += 1
if node.is_word:
yield start


class Solution():
def wordBreak(self, s: str, wordDict: List[str]) -> bool:
min_w = min(len(word) for word in wordDict)
tree = TrieTree(wordDict)

n = len(s)
buffer = [False] * n
for i in range(n - min_w, -1, -1):
for p in tree.lookup(s, i):
if p == n or buffer[p]:
buffer[i] = True
break

return buffer[0]

Inner method test cases:

solution2_inner_test.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
import pytest

from solution2 import TrieTree


@pytest.mark.parametrize('words, text, start, expected', [
(['code', 'leet'], 'leetcode', 0, ['leet']),
(['code', 'leet'], 'lee', 0, []),
(['code', 'leet'], 'leet', 0, ['leet']),
(['code', 'leet'], 'leetcode', 4, ['code']),
(['dog', 'dogs'], 'dogsandcat', 0, ['dog', 'dogs']),
(['', 'dog', 'dogs', 'cat'], 'dogsandcat', 0, ['', 'dog', 'dogs']),
])
class Test:
def test_trie_lookup(self, words, text, start, expected):
tree = TrieTree(words)
positions = list(tree.lookup(text, start))
result = [text[start:p] for p in positions]
assert result == expected