Text Searching And Other Interesting Things – Part 1

Recently some personal projects have gotten me interested in a few string and text searching algorithms and how they compare. I’ve so often heard, “what better way to learn, than to teach”, so thats the plan. I’m going to have a short series on some textbook text searching algorithms and an analysis of their performances. This series should be great for beginners wanting to familiarize themselves with some simple text searches and serve to be a decent brush up for a seasoned professional.

Brute Force:

This first post will cover the most basic of text searching algorithms: The Brute Force Search.

We start with two strings, the needle and the haystack. What we are looking for is the position of the needle in the haystack. Brute forcing works by simple looping through each character in the haystack until it finds a match on the first character of the needle, and then looping through an inner loop verifying that each character in the sequence is the same. If the inner loop concludes that it has found a full match, the position of the first character is returned. If it reaches the end of the haystack and no match has been found, the de facto “-1” position is returned.

Lets look at an example in c#:

public static int BruteForce(string needle, string haystack)
{
    //first get both inputs as character arrays
    char[] needleArray = needle.ToCharArray(), haystackArray = haystack.ToCharArray();
    //we loop from the beginning of the haystack to the point in the haystack where the number of characters left
    //is less than the number of characters in the needle.
    for (int haystackIndex = 0; haystackIndex <= haystackArray.Length - needleArray.Length; haystackIndex++)
    {
        //if the character being checked is the same as the first character, we can enter an inner loop that checks the word
        if (haystackArray[haystackIndex] == needleArray[0])
        {
            //Considering the sigle character case
            if (needleArray.Length < 2) {
                return haystackIndex;
            }
            //check each character one by one staring with the second character since we've confirmed the first
            for (int needleIndex = 1; needleIndex < needleArray.Length; needleIndex++)
            {
                //as long as each character matches we check to see if we've reached the end of the needle array, if so, we return
                //because we have a complete match. If there is a mismatch, break out of the loop and move on.
                if (haystackArray[haystackIndex + needleIndex] == needleArray[needleIndex])
                {
                    if (needleIndex == needleArray.Length - 1)
                    {
                        return haystackIndex;
                    }
                    continue;
                }
                break;
            }
        }
    }
    //if no matches are found, return the generic -1 index
    return -1;
}

Analysis:

Like all algorithms, we consider the time complexity to get an estimate of how performant we can expect our operation to be. For a brute force text search, we can expect a worst case complexity of O(nm)  where n is the length of the haystack and m is the length of the needle. What this is saying is, in the worst case, the number of iterations of the search loop will be equal to the length of the needle multiplied by the length of the haystack.

It makes sense that increasing the size of the haystack would make the search take longer but not increasing the size of the needle. Well, this is the primary drawback of the brute force search. In a real world physical needle and haystack, if the needle was bigger, it should be easier to find. The next algorithm we discuss actually exploits the length of the needle to make the search faster, but that is to be discussed next week.

What is promising about brute force searching is that it is very simple to understand and implement, and for small enough haystacks, there is virtually no time difference between it and an algorithm of lower complexity. In fact, using some more advanced algorithms that need to do some preprocessing of the haystack might actually make those operations slower.

 

This is the first of a series of weekly posts on text searching.

-Darrell