A Couple of Substring Functions: substr_count() and substr_by_delim()

A problem that sometimes when writing queries or stored routines is the need to use strings to do basic manipulation of the data. While from a performance perspective it is generally faster to do these manipulations inside the application, for various reasons it may be desirably to keep things inside MySQL.

This post lists two stored functions that can be used for simple manipulation of strings.

substr_count()

This is a port of the PHP function of the same name. It counts the number of times a given substring is encountered in a text. The signature is:

substr_count(
   in_haystack mediumtext,
   in_needle varchar(255),
   in_offset int unsigned,
   in_length int unsigned
) RETURNS int unsigned

The function as it stands here, has the following limitations and behaviours:

  • As the MySQL convention is to use 1 as the offset, so is the MySQL port of substr_count().
  • As MySQL stored functions do not support optional arguments, all arguments must be specified. For in_offset and in_length use NULL or 0 to use the default values.
  • The maximum length supported for the needle is 255 characters.
  • I am using the LOCATE() function as that returns the first occurrence of a substring after a given offset. This will particularly benefit the performance in cases where the needle is only sparsely present in the search string.

The definition of substr_count() is:

DELIMITER //
   CREATE
 FUNCTION substr_count(in_haystack mediumtext, in_needle varchar(255), in_offset int unsigned, in_length int unsigned) RETURNS int unsigned
 LANGUAGE SQL DETERMINISTIC NO SQL SQL SECURITY INVOKER
   BEGIN
      DECLARE v_count, v_haystack_len, v_needle_len, v_offset, v_endpos int unsigned DEFAULT 0;

      SET v_haystack_len = CHAR_LENGTH(in_haystack),
          v_needle_len   = CHAR_LENGTH(in_needle),
          v_offset       = IF(in_offset IS NOT NULL AND in_offset > 0, in_offset, 1),
          v_endpos       = IF(in_length IS NOT NULL AND in_length > 0, v_offset + in_length, v_haystack_len);

      -- The last offset to use with LOCATE is at v_endpos - v_needle_len.
      -- That also means that if v_needlen > v_endpos, the count is trivially 0
      IF (v_endpos > v_needle_len) THEN
         SET v_endpos = v_endpos - v_needle_len;
         WHILE (v_offset < v_endpos) DO
            SET v_offset = LOCATE(in_needle, in_haystack, v_offset);
            IF (v_offset > 0) THEN
               -- v_offset is now the position of the first letter in the needle.
               -- Skip the length of the needle to avoid double counting.
               SET v_count  = v_count  + 1,
                  v_offset = v_offset + v_needle_len;
            ELSE
               -- The needle was not found. Set v_offset = v_endpos to exit the loop.
               SET v_offset = v_endpos;
            END IF;
         END WHILE;
      END IF;

      RETURN v_count;
   END//

DELIMITER ;

An example of how to use substr_count() is:

mysql> SELECT substr_count('a/b/c/d/e', '/', 3, 5);
+--------------------------------------+
| substr_count('a/b/c/d/e', '/', 3, 5) |
+--------------------------------------+
|                                    2 |
+--------------------------------------+
1 row in set (0.00 sec)

substr_by_delim()

The substr_by_delim() function can be used to pick out one element of a delimited string; an example would be to determine the second element in the comma delimited string ‘a,b,c,d,e’:

substr_by_delim(
   in_set mediumtext,
   in_delimiter varchar(255),
   in_pos int
) RETURNS mediumtext

The width of in_delimiter is set to match the in_needle in substr_count() as substr_count() is used to find the number of delimiters in the search text.

DELIMITER //
   CREATE
 FUNCTION substr_by_delim(in_set mediumtext, in_delimiter varchar(255), in_pos int) RETURNS mediumtext
  COMMENT 'Returns the Nth element from a delimited list.'
 LANGUAGE SQL DETERMINISTIC NO SQL SQL SECURITY INVOKER
    BEGIN
      DECLARE v_num_parts int unsigned DEFAULT 0;

      IF (in_pos < 0) THEN
         -- substr_count returns the number of delimiters, add 1 to get the number of parts
         SET v_num_parts = substr_count(in_set, in_delimiter, NULL, NULL) + 1;
         IF (v_num_parts >= ABS(in_pos)) THEN
            -- Add the requested position (which is negative, so is actually a subtraction)
            -- Add 1 as the position is 1 based.
            SET in_pos = v_num_parts + in_pos + 1;
         ELSE
            -- The requested position is out of range, so set in_pos to 0.
            SET in_pos = 0;
         END IF;
      END IF;
      IF (in_pos <= 0 OR in_pos IS NULL OR in_pos > substr_count(in_set, in_delimiter, 0, NULL)+1) THEN
         -- in_pos is not BETWEEN 1 AND #of elements.
         RETURN NULL;
      ELSE
         RETURN SUBSTRING_INDEX(SUBSTRING_INDEX(in_set, in_delimiter, in_pos), in_delimiter, -1);
      END IF;
   END//
DELIMITER ;

Like with the built-in SUBSTRING_INDEX() function, it is supported both to use positive and negative positions where a negative position counts from the end of the search string.

Examples of how to used substr_by_delim() are:

mysql> SELECT substr_by_delim('a,b,c,d,e', ',', 2);
+--------------------------------------+
| substr_by_delim('a,b,c,d,e', ',', 2) |
+--------------------------------------+
| b                                    |
+--------------------------------------+
1 row in set (0.01 sec)

mysql> SELECT substr_by_delim('a,b,c,d,e', ',', -2);
+---------------------------------------+
| substr_by_delim('a,b,c,d,e', ',', -2) |
+---------------------------------------+
| d                                     |
+---------------------------------------+
1 row in set (0.00 sec)

mysql> SELECT substr_by_delim('a||b||c||d||e', '||', 2);
+-------------------------------------------+
| substr_by_delim('a||b||c||d||e', '||', 2) |
+-------------------------------------------+
| b                                         |
+-------------------------------------------+
1 row in set (0.00 sec)

Hope you will find the two functions useful.

6 thoughts on “A Couple of Substring Functions: substr_count() and substr_by_delim()

  1. Shlomi Noach 22 October, 2012 / 16:36

    Hi,

    Are you familiar with common_schema?

    It contains functions like the above and many others; and in fact aims to provide with a general-purpose “class library”, so to speak.

    See split_token() and get_num_tokens() on the text routines.

    • Jesper Krogh 22 October, 2012 / 17:59

      Yes, the common_schema is another option.

  2. strcmp 22 October, 2012 / 16:58

    Why can’t you implement substr_count() without a loop via (CHAR_LENGTH(haystack)-CHAR_LENGTH(REPLACE(haystack,needle,”)))/CHAR_LENGTH(needle)?

    • Jesper Krogh 22 October, 2012 / 18:01

      That will work as well (adding a haystack = SUBSTRING(…) if you want support for offset and length). Testing this with the Performance Schema suggest the performance of the two implementations are fairly similar.

Leave a Reply to strcmp Cancel reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.