LeetCode 187 – Repeated DNA Sequences (Java Solution with Sliding Window and HashSet)
LeetCode 187 – Repeated DNA Sequences (Java Solution with Sliding Window and HashSet)

LeetCode 187 – Repeated DNA Sequences (Java Solution with Sliding Window and HashSet)

Step-by-Step Explanation with Optimized Approach and Time Complexity Analysis

10 views
0
0

Introduction

In this article, we will solve LeetCode 187: Repeated DNA Sequences using Java. This is a popular string problem that tests your understanding of the sliding window technique and efficient use of hash-based data structures.

DNA sequences are composed of four characters:

  1. A (Adenine)
  2. C (Cytosine)
  3. G (Guanine)
  4. T (Thymine)

The goal is to identify all 10-letter-long substrings that appear more than once in a given DNA string.

You can try solving the problem directly on LeetCode here:

https://leetcode.com/problems/repeated-dna-sequences/


Problem Statement

Given a string s that represents a DNA sequence, return all the 10-letter-long substrings that occur more than once.

Example 1

Input:

s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT"

Output:

["AAAAACCCCC", "CCCCCAAAAA"]

Example 2

Input:

s = "AAAAAAAAAAAAA"

Output:

["AAAAAAAAAA"]


Key Observations

  1. We only need substrings of fixed length 10.
  2. The maximum length of the string can be up to 10^5.
  3. A brute-force solution checking all substrings multiple times would be inefficient.
  4. This problem can be solved efficiently using a sliding window and hash-based data structures.


Approach 1: Sliding Window with HashSet (Given Solution)

Idea

  1. Use two pointers (i and j) to maintain a sliding window.
  2. Build a substring of size 10 dynamically.
  3. Store previously seen substrings in a HashSet.
  4. If a substring is already present in the set:
  5. Check if it is already in the result list.
  6. If not, add it to the result list.
  7. Slide the window forward and continue.

Java Code (Your Implementation)

class Solution {
public List<String> findRepeatedDnaSequences(String s) {
HashSet<String> ms = new HashSet<>();
List<String> lis = new ArrayList<>();
int i = 0;
int j = 0;
String tes = "";
while (j < s.length()) {
tes += s.charAt(j);
if (j - i + 1 < 10) {
j++;
} else {
if (j - i + 1 == 10) {
if (ms.contains(tes)) {
boolean fl = false;
for (String a : lis) {
if (a.equals(tes)) {
fl = true;
}
}
if (!fl) {
lis.add(tes);
}
} else {
ms.add(tes);
}
tes = tes.substring(1);
i++;
j++;
}
}
}
return lis;
}
}

Explanation

  1. The variable tes maintains the current substring.
  2. ms stores all previously seen substrings of length 10.
  3. If a substring already exists in ms, we manually check whether it has already been added to the result list.
  4. This avoids duplicate entries in the final output.

Time Complexity

  1. Sliding through the string: O(n)
  2. Checking duplicates in the result list: O(n) in the worst case
  3. Overall worst-case complexity: O(n²)

Space Complexity

  1. HashSet storage: O(n)


Limitation of Approach 1

The manual duplicate check using a loop inside the result list introduces unnecessary overhead. This makes the solution less efficient.

We can improve this by using another HashSet to automatically handle duplicates.


Approach 2: Optimized Solution Using Two HashSets

Idea

  1. Use one HashSet called seen to track all substrings of length 10.
  2. Use another HashSet called repeated to store substrings that appear more than once.
  3. Iterate from index 0 to s.length() - 10.
  4. Extract substring of length 10.
  5. If adding to seen fails, it means it has appeared before.
  6. Add it directly to repeated.

This removes the need for a nested loop.

Optimized Java Code

class Solution {
public List<String> findRepeatedDnaSequences(String s) {
Set<String> seen = new HashSet<>();
Set<String> repeated = new HashSet<>();

for (int i = 0; i <= s.length() - 10; i++) {
String substring = s.substring(i, i + 10);
if (!seen.add(substring)) {
repeated.add(substring);
}
}

return new ArrayList<>(repeated);
}
}

Why This Approach is Better

  1. No manual duplicate checking.
  2. Cleaner and more readable code.
  3. Uses HashSet properties efficiently.
  4. Each substring is processed only once.

Time Complexity (Optimized)

  1. Single traversal of the string: O(n)
  2. Substring extraction of fixed length 10: O(1)
  3. Overall time complexity: O(n)

Space Complexity

  1. Two HashSets storing substrings: O(n)


Conclusion

LeetCode 187 is a classic example of combining the sliding window technique with hash-based data structures.

  1. The first approach works but has unnecessary overhead due to manual duplicate checks.
  2. The second approach is more optimal, cleaner, and recommended for interviews.
  3. Always leverage the properties of HashSet to avoid redundant checks.

This problem highlights the importance of choosing the right data structure to optimize performance.

Ai Assistant Kas