
LeetCode 187 – Repeated DNA Sequences (Java Solution with Sliding Window and HashSet)
IntroductionIn this article, we will solve LeetCode 187: Repeated DNA Sequences using Java. This is a popular string problem that tests your understanding of the sliding window technique and efficient use of hash-based data structures.DNA sequences are composed of four characters:A (Adenine)C (Cytosine)G (Guanine)T (Thymine)The goal is to identify all 10-letter-long substrings that appear more than once in a given DNA string.You can try solving the problem directly on LeetCode here: https://leetcode.com/problems/repeated-dna-sequences/Problem StatementGiven a string s that represents a DNA sequence, return all the 10-letter-long substrings that occur more than once.Example 1Input: s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT"Output: ["AAAAACCCCC", "CCCCCAAAAA"]Example 2Input: s = "AAAAAAAAAAAAA"Output: ["AAAAAAAAAA"]Key ObservationsWe only need substrings of fixed length 10.The maximum length of the string can be up to 10^5.A brute-force solution checking all substrings multiple times would be inefficient.This problem can be solved efficiently using a sliding window and hash-based data structures.Approach 1: Sliding Window with HashSet (Given Solution)IdeaUse two pointers (i and j) to maintain a sliding window.Build a substring of size 10 dynamically.Store previously seen substrings in a HashSet.If a substring is already present in the set:Check if it is already in the result list.If not, add it to the result list.Slide the window forward and continue.Java Code (Your Implementation)class Solution { public List<String> findRepeatedDnaSequences(String s) { HashSet<String> ms = new HashSet<>(); List<String> lis = new ArrayList<>(); int i = 0; int j = 0; String tes = ""; while (j < s.length()) { tes += s.charAt(j); if (j - i + 1 < 10) { j++; } else { if (j - i + 1 == 10) { if (ms.contains(tes)) { boolean fl = false; for (String a : lis) { if (a.equals(tes)) { fl = true; } } if (!fl) { lis.add(tes); } } else { ms.add(tes); } tes = tes.substring(1); i++; j++; } } } return lis; }}ExplanationThe variable tes maintains the current substring.ms stores all previously seen substrings of length 10.If a substring already exists in ms, we manually check whether it has already been added to the result list.This avoids duplicate entries in the final output.Time ComplexitySliding through the string: O(n)Checking duplicates in the result list: O(n) in the worst caseOverall worst-case complexity: O(n²)Space ComplexityHashSet storage: O(n)Limitation of Approach 1The manual duplicate check using a loop inside the result list introduces unnecessary overhead. This makes the solution less efficient.We can improve this by using another HashSet to automatically handle duplicates.Approach 2: Optimized Solution Using Two HashSetsIdeaUse one HashSet called seen to track all substrings of length 10.Use another HashSet called repeated to store substrings that appear more than once.Iterate from index 0 to s.length() - 10.Extract substring of length 10.If adding to seen fails, it means it has appeared before.Add it directly to repeated.This removes the need for a nested loop.Optimized Java Codeclass Solution { public List<String> findRepeatedDnaSequences(String s) { Set<String> seen = new HashSet<>(); Set<String> repeated = new HashSet<>(); for (int i = 0; i <= s.length() - 10; i++) { String substring = s.substring(i, i + 10); if (!seen.add(substring)) { repeated.add(substring); } } return new ArrayList<>(repeated); }}Why This Approach is BetterNo manual duplicate checking.Cleaner and more readable code.Uses HashSet properties efficiently.Each substring is processed only once.Time Complexity (Optimized)Single traversal of the string: O(n)Substring extraction of fixed length 10: O(1)Overall time complexity: O(n)Space ComplexityTwo HashSets storing substrings: O(n)ConclusionLeetCode 187 is a classic example of combining the sliding window technique with hash-based data structures.The first approach works but has unnecessary overhead due to manual duplicate checks.The second approach is more optimal, cleaner, and recommended for interviews.Always leverage the properties of HashSet to avoid redundant checks.This problem highlights the importance of choosing the right data structure to optimize performance.






