Finding all the indexes of a whole word in a given string using java


Intent

To find all the indexes of a whole word in a given searchable string.

Motivation

Today, I had to write a piece of code in which I had to find all the indexes of a particular keyword in a searchable string. Most of the times, when we have to find index of a keyword in a searchable string we use indexOf method of String class.

For example,


String searchableString = “Don’t be evil. Being evil is bad”;

String keyword = “be”;

So, we can find the index of keyword as


int index = searchableString.indexOf(keyword);

This will give us the index of first occurrence of keyword (“be”). Now suppose, we have to find all the indexes of keyword (“be”), we will have to loop over searchableString and find all the indexes of keyword “be”.

This can be done as follows

	public static void findIndexes(){
		String searchableString = "don’t be evil.being evil is bad";
		String keyword = "be";

		int index = searchableString.indexOf(keyword);
		while (index >=0){
			System.out.println("Index : "+index);
			index = searchableString.indexOf(keyword, index+keyword.length())	;
		}

	}

	public static void main(String[] args) {
		findIndexes();
	}

This program will print :

Index :  6

Index : 14

There is a problem in this code as we should not get “Index : 14” because we are looking for a keyword “be” and the index it has found is from word “being” . This just looks for any occurrence of keyword in the string it does not have to be a whole word. To solve this problem, we will be writing a solution using Regular expressions.

Solution

We have to find out the indexes which correspond to the whole word. For example, we should only get “Index : 6” as that corresponds to the keyword “be” not “Index : 14″ as it corresponds to the word “being” .

This can be done as follows :-

WholeWordIndexFinder.java

package org.dailywtf.string;

import java.util.ArrayList;
import java.util.List;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class WholeWordIndexFinder {

	private String searchString;

	public WholeWordIndexFinder(String searchString) {
		this.searchString = searchString;
	}

	public List<IndexWrapper> findIndexesForKeyword(String keyword) {
		String regex = "\\b"+keyword+"\\b";
		Pattern pattern = Pattern.compile(regex);
		Matcher matcher = pattern.matcher(searchString);

		List<IndexWrapper> wrappers = new ArrayList<IndexWrapper>();

		while(matcher.find() == true){
			int end = matcher.end();
			int start = matcher.start();
			IndexWrapper wrapper = new IndexWrapper(start, end);
			wrappers.add(wrapper);
		}
		return wrappers;
	}

	public static void main(String[] args) {
		WholeWordIndexFinder finder = new WholeWordIndexFinder("don’t be evil.being evil is bad");
		List<IndexWrapper> indexes = finder.findIndexesForKeyword("be");
		System.out.println("Indexes found "+indexes.size() +" keyword found at index : "+indexes.get(0).getStart());
	}

}

IndexWrapper.java

package org.dailywtf.string;

public class IndexWrapper {

	private int start;
	private int end;

	public IndexWrapper(int start, int end) {
		this.start = start;
		this.end = end;
	}

	public int getEnd() {
		return end;
	}

	public int getStart() {
		return start;
	}

	@Override
	public int hashCode() {
		final int prime = 31;
		int result = 1;
		result = prime * result + end;
		result = prime * result + start;
		return result;
	}

	@Override
	public boolean equals(Object obj) {
		if (this == obj)
			return true;
		if (obj == null)
			return false;
		if (getClass() != obj.getClass())
			return false;
		IndexWrapper other = (IndexWrapper) obj;
		if (end != other.end)
			return false;
		if (start != other.start)
			return false;
		return true;
	}

}

Explanation
First of all a regular expression is built using the keyword. Regex \b defines a word boundary. \b allows you to perform a “whole words only” search using a regular expression in the form of \bword\b.
More information on regex could be found at this link.  Next, we just iterate over the matcher till matcher can find any match. Whenever a match is found an wrapper object is created which contains the start and end position of the keyword.

Using the above code, we can find all the indexes of a keyword in a searchable string.

10 Responses to Finding all the indexes of a whole word in a given string using java

  1. Tym the Enchanter says:

    Nice little solution, two things though.

    1. The hashCode for index wrapper only returns the value of the evaluation in line 26

    result = prime * result + start;

    If you want to assign 1 to result then the next two lines should start result += and not result =, but then do you need the line

    int result = 1 anyway?

    2. Maybe adding a flag to do case insensitive search, or to allow the addition of any flags as OR’d ints would be useful.

    Tym

    • whyjava says:

      1. I used eclipse to generate the hashcode and equals method. In my view, the hashcode method is correct because in line 25
      result = prime * result + end;
      it calculates the value of result based on the end value and now this result value is used
      in the line 26
      result = prime * result + start;

      So, result is calculated based on the value of result in line 25.

      2. Yes, we could add a flag for specifying whether we want case sensitive or case insensitive search.

      Thanks
      whyjava

  2. Artur Biesiadowski says:

    If you want to use it often, caching Pattern for given keywords might be thing to consider (unless you are looking for different words each single time).

    If you use it VERY often, write it on your own without using regexp (after profiling that this is really a hotspot in the app). I remember that rewriting some regexp trying to detect date format to hand-crafted logic gave 30 times speedup in synthetic microbenchmark and noticeable speedup in real app.

  3. scalaBoy says:

    In scala:

    val str = "Don't be evil. Being evil is bad"
    val key = "be"
    val regex = ("\\b" + key + "\\b").r
    (for(m <- regex findAllIn str matchData) yield (m.start,m.end)).toList
    

    would get List((6,8))

  4. scalaBoy says:

    In scala, the following codes would get List((6,8))

    val str = “Don’t be evil. Being evil is bad”
    val key = “be”
    val regex = (“\\b” + key + “\\b”).r
    (for(m <- regex findAllIn str matchData) yield (m.start,m.end)).toList

  5. skhanzada says:

    how you have added syntax highlighter in your blog?

  6. Juri says:

    Thank you for the post

    In case keyword contains some regex reserved characters, the keyword should be escaped:
    String regex = “\\b”+ Pattern.quote(keyword) +”\\b”;

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 175 other followers