FeedbackArticles

STRINGS

In computer science, a string is a sequence of characters, such as letters, numbers, and symbols. Strings are a fundamental data type in many programming languages, and are used to represent text, data, and other types of information.

Strings can be represented in many ways, such as as an array of characters or a linked list of characters. Many programming languages provide built-in string types and libraries that allow you to manipulate and process strings.

Some common operations on strings include concatenation (joining two or more strings together), substring (extracting a portion of a string), searching for a substring within a string, and replacing a substring with another string.

Strings can be compared using lexicographic order, which compares the first characters of the strings, and then compares the second characters if the first characters are equal, and so on. Many programming languages provide built-in functions for comparing strings, such as the equals method in Java.

Strings can also be processed using regular expressions, which are patterns that match specific strings. Regular expressions can be used to search for patterns in a string, extract information from a string, or validate a string against a certain pattern.

String algorithms and data structures are important topics in computer science, and include topics such as string matching, string compression, and suffix trees. These topics are used in a wide variety of applications, such as text processing, bioinformatics, and natural language processing.

public class MyString {

    private final char[] value;

    private final int length;

    public MyString(char[] value) {

        this.value = value;

        this.length = value.length;

    }

    public MyString(String s) {

        this.value = s.toCharArray();

        this.length = s.length();

    }

    public char charAt(int index) {

        if (index < 0 || index >= length) {

            throw new IndexOutOfBoundsException("Index " + index + " out of bounds for length " + length);

        }

        return value[index];

    }

    public int length() {

        return length;

    }

    public MyString substring(int beginIndex, int endIndex) {

        if (beginIndex < 0 || endIndex > length || beginIndex > endIndex) {

            throw new IndexOutOfBoundsException("beginIndex: " + beginIndex + ", endIndex: " + endIndex + ", length: " + length);

        }

        char[] result = new char[endIndex - beginIndex];

        System.arraycopy(value, beginIndex, result, 0, result.length);

        return new MyString(result);

    }

    public MyString concat(MyString s) {

        char[] result = new char[length + s.length()];

        System.arraycopy(value, 0, result, 0, length);

        System.arraycopy(s.value, 0, result, length, s.length);

        return new MyString(result);

    }

    public boolean equals(MyString s) {

        if (length != s.length) {

            return false;

        }

        for (int i = 0; i < length; i++) {

            if (value[i] != s.value[i]) {

                return false;

            }

        }

        return true;

    }

    public int indexOf(MyString s) {

        if (s.length > length) {

            return -1;

        }

        for (int i = 0; i <= length - s.length; i++) {

            boolean found = true;

            for (int j = 0; j < s.length; j++) {

                if (value[i + j] != s.value[j]) {

                    found = false;

                    break;

                }

            }

            if (found) {

                return i;

            }

        }

        return -1;

    }

    public String toString() {

        return new String(value);

    }

}

This implementation defines a MyString class that represents a string of characters. The class has two constructors, one that takes an array of characters and another that takes a String object. The class has several methods, including:

  • charAt: returns the character at the specified index
  • length: returns the length of the string
  • substring: returns a substring of the string, starting at the specified begin index and ending at the specified end index
  • concat: concatenates this string with another string
  • equals: returns true if this string is equal to another string
  • indexOf: returns the index of the first occurrence of the specified substring in this string

This implementation is not complete, and does not include all of the methods and functionality of the built-in String class in Java. However, it should provide a basic understanding of how a string class can be implemented in Java, and how the fundamental methods of a string class can be implemented.

NAIVE STRING MATCHING

Naive string matching is a simple algorithm used to find occurrences of a pattern string in a text string. The algorithm compares each character in the pattern string to each character in the text string, starting at each position in the text string, until a match is found or all positions have been checked.

The naive string matching algorithm has a time complexity of O(mn), where m is the length of the pattern string and n is the length of the text string. The algorithm is simple to implement and works well for small pattern strings and text strings.

Here is an example implementation of the naive string matching algorithm in Java:

public static List<Integer> naiveStringMatch(String pattern, String text) {

    List<Integer> matches = new ArrayList<Integer>();

    int m = pattern.length();

    int n = text.length();

    for (int i = 0; i <= n - m; i++) {

        int j;

        for (j = 0; j < m; j++) {

            if (text.charAt(i + j) != pattern.charAt(j)) {

                break;

            }

        }

        if (j == m) {

            matches.add(i);

        }

    }

    return matches;

}

This implementation takes two string arguments, a pattern string and a text string, and returns a list of the starting positions of all occurrences of the pattern string in the text string. The algorithm works by iterating over each position in the text string, and comparing the characters in the pattern string to the characters in the text string, starting at each position.

The algorithm uses two nested loops, one for iterating over the positions in the text string, and one for iterating over the characters in the pattern string. If a mismatch is found between a character in the pattern string and the corresponding character in the text string, the algorithm breaks out of the inner loop and continues checking the next position in the text string.

If the inner loop completes without finding a mismatch, the algorithm has found a match, and adds the starting position of the match to the list of matches.

The naive string matching algorithm is simple to implement, but can be slow for large strings, since it compares each character in the pattern string to each character in the text string. More efficient string matching algorithms, such as the Knuth-Morris-Pratt algorithm and the Boyer-Moore algorithm, use additional techniques to avoid unnecessary comparisons and reduce the time complexity of the algorithm.

RABIN KARP ALGORITHM

The Rabin-Karp algorithm is a string matching algorithm that is used to find occurrences of a pattern string in a text string. The algorithm works by using a hash function to compare the pattern string to the text string, and is more efficient than the naive string matching algorithm for large pattern strings and text strings.

The Rabin-Karp algorithm uses a rolling hash function to compute a hash value for each substring of the text string that is the same length as the pattern string. The hash function is designed such that it can be computed in constant time for each substring, allowing the algorithm to avoid the O(mn) time complexity of the naive string matching algorithm.

The Rabin-Karp algorithm works as follows:

  1. Compute the hash value of the pattern string.
  2. Compute the hash value of the first substring of the text string that is the same length as the pattern string.
  3. Compare the hash values. If they match, compare the actual substrings character by character. If they do not match, compute the hash value of the next substring by removing the first character of the previous substring and adding the next character of the text string. Repeat step 3 until a match is found or all substrings have been checked.

The hash function used in the Rabin-Karp algorithm is designed to be fast and avoid collisions between different substrings. A common approach is to use a polynomial hash function, which represents a string as a polynomial of its characters and evaluates the polynomial at a fixed value, such as a prime number. This allows the hash value to be computed in constant time using a rolling hash algorithm, which updates the hash value by subtracting the contribution of the first character of the previous substring and adding the contribution of the next character.

The time complexity of the Rabin-Karp algorithm is O(m + n), where m is the length of the pattern string and n is the length of the text string. This is the same as the time complexity of the best-case scenario for the naive string matching algorithm, but the Rabin-Karp algorithm can be much faster for large strings, since it avoids unnecessary comparisons by using the hash function.

Here is an example implementation of the Rabin-Karp algorithm in Java:

public static List<Integer> rabinKarp(String pattern, String text) {

    List<Integer> matches = new ArrayList<Integer>();

    int m = pattern.length();

    int n = text.length();

    int prime = 101; // prime number used in hash function

    int patternHash = 0;

    int textHash = 0;

    int h = 1;

    // compute hash value of pattern and first substring of text

    for (int i = 0; i < m - 1; i++) {

        h = (h * 256) % prime;

    }

    for (int i = 0; i < m; i++) {

        patternHash = (patternHash * 256 + pattern.charAt(i)) % prime;

        textHash = (textHash * 256 + text.charAt(i)) % prime;

    }

    // compare hash values and actual substrings

    for (int i = 0; i <= n - m; i++) {

        if (patternHash == textHash) {

            int j;

            for (j = 0; j < m; j++) {

                if (text.charAt(i + j) != pattern.charAt(j)) {

                    break;

                }

            }

            if (j == m) {

                matches.add(i);

            }

        }

        if (i < n - m) {

            textHash = ((textHash - text.charAt(i) * h) * 256 + text.charAt(i + m)) % prime;

            if (textHash < 0) {

                textHash += prime;

            }

        }

    }

    return matches;

}

This implementation takes two string arguments, a pattern string and a text string, and returns a list of the starting positions of all occurrences of the pattern string in the text string. The algorithm works by computing the hash value of the pattern string and the first substring of the text string that is the same length as the pattern string. It then compares the hash values, and if they match, compares the actual substrings character by character.

The implementation uses a polynomial hash function to compute the hash values, with a prime number used as the modulus to reduce the risk of hash collisions. The implementation also uses a rolling hash algorithm to compute the hash values for each substring, allowing the algorithm to avoid the O(mn) time complexity of the naive string matching algorithm.

The Rabin-Karp algorithm is a powerful string matching algorithm that can be used in a variety of applications, including text search and data compression. However, it has some limitations, including the risk of hash collisions and the sensitivity to the choice of hash function parameters.

KNUTH MORRIS PRATT ALGORITHM

The Knuth-Morris-Pratt (KMP) algorithm is a string matching algorithm that is used to find occurrences of a pattern string in a text string. The algorithm works by using a prefix function to determine the maximum length of a proper prefix of the pattern string that is also a suffix of the pattern string. This information is used to skip over unnecessary comparisons between the pattern string and the text string during the matching process.

The KMP algorithm works as follows:

  1. Preprocess the pattern string to compute the prefix function, which is a table that stores the length of the longest proper prefix of the pattern string that is also a suffix of the pattern string.
  2. Initialize two pointers, i and j, to 0, representing the current position in the text string and the pattern string, respectively.
  3. While i is less than the length of the text string, compare the character at position i in the text string to the character at position j in the pattern string. If they match, increment both i and j. If j is equal to the length of the pattern string, a match has been found and the index of the match can be added to the list of matches.
  4. If they do not match, use the prefix function to determine the next position to compare in the pattern string. Set j to the value of the prefix function at position j, and continue with step 3.

The prefix function used in the KMP algorithm is computed using dynamic programming, and is based on the observation that if a substring of the pattern string is a proper prefix of the pattern string and also a suffix of the pattern string, then the next character after the substring must be different from the last character of the substring. This allows the prefix function to be computed in O(m) time, where m is the length of the pattern string.

The time complexity of the KMP algorithm is O(m + n), where m is the length of the pattern string and n is the length of the text string. This is the same as the time complexity of the Rabin-Karp algorithm, but the KMP algorithm can be more efficient for certain types of patterns and texts, since it avoids unnecessary comparisons by using the prefix function.

Here is an example implementation of the KMP algorithm in Java:

public static List<Integer> kmp(String pattern, String text) {

    List<Integer> matches = new ArrayList<Integer>();

    int m = pattern.length();

    int n = text.length();

    int[] prefix = computePrefix(pattern);

    // initialize pointers

    int i = 0;

    int j = 0;

    // compare characters and update pointers

    while (i < n) {

        if (pattern.charAt(j) == text.charAt(i)) {

            i++;

            j++;

        }

        if (j == m) {

            matches.add(i - j);

            j = prefix[j - 1];

        } else if (i < n && pattern.charAt(j) != text.charAt(i)) {

            if (j != 0) {

                j = prefix[j - 1];

            } else {

                i++;

            }

        }

    }

    return matches;

}

private static int[] computePrefix(String pattern) {

    int m = pattern.length();

    int[] prefix = new int[m];

    prefix[0] = 0;

    int i = 0;

    int j = 1;

    while (j < m) {

        if (pattern.charAt(i) == pattern.charAt(j)) {

            i++;

            prefix[j] = i;

            j++;

        } else {

            if (i != 0) {

                i = prefix[i - 1];

            } else {

                prefix[j] = i;

                j++;

            }

        }

    }

    return prefix;

}

This implementation takes two string arguments, a pattern string and a text string, and returns a list of the starting positions of all occurrences of the pattern string in the text string.

The algorithm works by first computing the prefix function for the pattern string using the computePrefix method. The kmp method then initializes two pointers, i and j, to 0, representing the current position in the text string and the pattern string, respectively.

The algorithm then compares the character at position i in the text string to the character at position j in the pattern string. If they match, both pointers are incremented. If j is equal to the length of the pattern string, a match has been found and the index of the match can be added to the list of matches.

If the characters do not match, the prefix function is used to determine the next position to compare in the pattern string. The value of j is set to the value of the prefix function at position j, and the algorithm continues with the next character in the text string.

The time complexity of the KMP algorithm is O(m + n), where m is the length of the pattern string and n is the length of the text string. This is the same as the time complexity of the Rabin-Karp algorithm, but the KMP algorithm can be more efficient for certain types of patterns and texts, since it avoids unnecessary comparisons by using the prefix function.

BOYER MOORE ALGORITHM

The Boyer-Moore algorithm is a string searching algorithm that is used to find the occurrence of a pattern string in a text string. It is based on two main ideas:

The bad character rule: In the search process, if a mismatch occurs at position j in the pattern string, the algorithm skips over all occurrences of the character that caused the mismatch in the text string to the right of position j.

The good suffix rule: If a suffix of the pattern string matches a substring of the text string, the algorithm can skip over the matching suffix and align the next character of the pattern string with the next character of the text string.

The Boyer-Moore algorithm is particularly efficient when the pattern string is longer than the text string, or when the pattern string contains many repeated characters. The algorithm can be implemented using two arrays, the bad character table and the good suffix table.

The bad character table is a table that stores the rightmost occurrence of each character in the pattern string. If a mismatch occurs at position j in the pattern string and the mismatched character is c, the algorithm skips over all occurrences of c in the text string to the right of position j.

The good suffix table is a table that stores the maximum shift distance for each suffix of the pattern string that matches a prefix of the pattern string. If a suffix of the pattern string matches a substring of the text string, the algorithm can skip over the matching suffix and align the next character of the pattern string with the next character of the text string.

The time complexity of the Boyer-Moore algorithm is O(m + n), where m is the length of the pattern string and n is the length of the text string. In practice, the algorithm can be much faster than other string matching algorithms in many cases.

Here is an example implementation of the Boyer-Moore algorithm in Java:

public static List<Integer> boyerMoore(String pattern, String text) {

    List<Integer> matches = new ArrayList<Integer>();

    int m = pattern.length();

    int n = text.length();

    int[] badChar = new int[256];

    int[] goodSuffix = new int[m];

    // initialize bad character table

    for (int i = 0; i < 256; i++) {

        badChar[i] = -1;

    }

    for (int i = 0; i < m; i++) {

        badChar[pattern.charAt(i)] = i;

    }

    // initialize good suffix table

    int[] suffixes = computeSuffixes(pattern);

    for (int i = 0; i < m; i++) {

        goodSuffix[i] = m;

    }

    for (int i = m - 1; i >= 0; i--) {

        if (suffixes[i] == i + 1) {

            for (int j = 0; j < m - i - 1; j++) {

                if (goodSuffix[j] == m) {

                    goodSuffix[j] = m - i - 1;

                }

            }

        }

    }

    for (int i = 0; i <= m - 2; i++) {

        goodSuffix[m - 1 - suffixes[i]] = m - i - 1;

    }

    // initialize pointers

    int i = m - 1;

    int j = m - 1;

    // compare characters and update pointers

    while (i < n) {

        if (pattern.charAt(j) == text.charAt(i)) {

            if (j == 0) {

                matches.add(i);

                i += m;

                j = m - 1;

            } else {

                i--;

                j--;

            }

        } else {

            int bcShift = badChar[text.charAt(i)];

            int gsShift = goodSuffix[j];

            int shift = Math.max(gsShift, j - bcShift);

            i += shift;

            j = m - 1;

        }

    }

    return matches;

}

private static int[] computeSuffixes(String pattern) {

    int m = pattern.length();

    int[] suffixes = new int[m];

    int f = 0;

    int g = m - 1;

    suffixes[m - 1] = m;

    for (int i = m - 2; i >= 0; i--) {

        if (i > g && suffixes[i + m - 1 - f] < i - g) {

            suffixes[i] = suffixes[i + m - 1 - f];

        } else {

            if (i < g) {

                g = i;

            }

            f = i;

            while (g >= 0 && pattern.charAt(g) == pattern.charAt(g + m - 1 - f)) {

                g--;

            }

            suffixes[i] = f - g;

        }

    }

    return suffixes;

}

This implementation takes two string arguments, a pattern string and a text string, and returns a list of the starting positions of all occurrences of the pattern string in the text string.

The `computeSuffixes` method computes the suffixes of the pattern string using the method described in the good suffix rule. The `boyerMoore` method initializes the bad character table and the good suffix table, and initializes two pointers, `i` and `j`, to the last positions in the pattern string.

The algorithm then compares the character at position `j` in the pattern string to the character at position `i` in the text string. If they match, both pointers are decremented. If `j` reaches the beginning of the pattern string, a match has been found and the index of the match can be added to the list of matches.

If the characters do not match, the algorithm uses the bad character table and the good suffix table to determine the next position to compare in the text string. The algorithm shifts the pattern string to align the next character of the pattern string with the next character of the text string.

The time complexity of the Boyer-Moore algorithm is O(m + n), where m is the length of the pattern string and n is the length of the text string. In practice, the algorithm can be much faster than other string matching algorithms in many cases, particularly when the pattern string is long and the text string is relatively short.

PATTERNS

Most patterns of strings are common or equal to array patterns, with the addition of Prefix/Suffix.

PREFIX/SUFFIX

The Prefix/Suffix pattern is a common pattern used in problems involving strings. It involves precomputing some values based on the prefixes or suffixes of the string, and using those values to perform some operation on the string. This can be useful in solving problems that require finding repeated substrings or patterns in the string.

One example of the Prefix/Suffix pattern is the KMP (Knuth-Morris-Pratt) algorithm for pattern matching. The KMP algorithm uses a precomputed table of values based on the prefixes of the pattern to efficiently search for the pattern in a text string. The table is constructed by finding the length of the longest proper prefix of the pattern that is also a proper suffix of the pattern for each position in the pattern.

Another example of the Prefix/Suffix pattern is the Z algorithm for pattern matching. The Z algorithm also uses a precomputed table of values based on the prefixes of the pattern to efficiently search for the pattern in a text string. The table is constructed by finding the length of the longest common prefix between the pattern and each of its suffixes.

In both cases, the precomputed tables allow for efficient pattern matching by avoiding unnecessary comparisons and backtracking. By precomputing values based on the prefixes or suffixes of the string, we can more efficiently perform some operation on the string.

The Prefix/Suffix pattern can be used in a variety of problems involving strings, such as finding the longest repeated substring or pattern in a string, or searching for a pattern in a text string. By recognizing this pattern and applying it appropriately, we can more easily solve these types of problems.

PROBLEMS

Easy

  1. Reverse String
  2. Valid Palindrome

Medium

  1. Longest Substring Without Repeating Characters
  2. Valid Parentheses
  3. Longest Palindromic Substring
  4. Group Anagrams
  5. ZigZag Conversion

Hard

  1. Longest Common Subsequence
  2. Regular Expression Matching

SEE ALSO