Split the dataset columns using Regex

Modified on Mon, 17 Feb at 11:59 AM

Scenario:

The data contains Address column that has Name, City, Email, Contact Number, etc. separated with special characters.

Requirement:

Split the column using Regex to generate different columns for Name, City, Contact Number, etc.

Sample Data

In the below given dataset, the Address column values contain Name, City, State, etc. in a single line with different separators such as ‘,’, ‘;’, ‘-’, etc.

For example, values like “Name: Mike Pelletier, City- Laguna Niguel State; California Contact No: +1 9410004015”.

The different parts of this value are;

Name: Mike Pelletier,

City- Laguna Niguel

State; California

Contact No: +1 9410004015

The separators used to split the values are “:, ,, -, :”.

Split the Name from the Address column

Name is given with different separators like {;,:,-,,} as shown below. We will be splitting the name and creating a column with a single regex expression here.

1. To split the Name column, right click on the Address column - >Split ->Split to Column

Split Column panel will be displayed as below.

SPLIT COLUMN PANEL

2. Select Regex option. Give the Expression as “[\,\;\-]”. Provide new column name and Click APPLY.

The best fit expression that is used to fulfil our requirement is [\;\,\-].

\; - This will match “;” character.

\, - This will match “,” character.

\- - This will match “-” character.

The result will be split to a new Name column from where the regex expression has matched. As per the regex expression given, the first matching character was a Comma (,) which occurred after the Name field in the string. So the name column was extracted from the whole line;

Name: Mike Pelletier.

RESULT WITH COLUMN NAME VALUES

To remove the characters “Name:” from the column value of NAME_H, use the same regex expression (“[\,\;\-]”) again to split.

The final result will have extracted name values as below. Do the same steps for extracting other columns like City, State, Contact number, etc.

RESULT WITH NAME VALUES EXTRACTED

Similarly for extracting Contact number follow the below steps.

Split the Contact number from the Address column

ADDRESS COLUMN HAVING CONTACT NO VALUES

Address values given are like “Name: Mike Pelletier, City- Laguna Niguel State; California Contact No: +1 9410004015”.

We will be splitting the contact number (+1 9410004015) and creating a column with a single regex expression - [+][0-9]{1} ?[0-9].

SPLITTING BY REGEX

The result would split the contact number to a different column.

DATASET AFTER SPLITTING MOBILE NUMBER COLUMN

Basics of Regex expression engine and examples:

Basically a regex-directed engine processes regular expressions, trying to match the pattern to the given string. If a match is found, it advances through the regex and the subject string. If the characters fail to match, it picks up the next position in the regex and the subject string where it can try a different path through the regex.

Some examples are given as below.

String example	Regex Expression	Result	Logic
Name: Mike Pelletier, City- Laguna Niguel State; California Contact No: + 9410004015	[\,\;\-] (From Left)	Name: Mike Pelletier	The string will be split from the Comma (,) as per the regex rule.
Name: Mike Pelletier, City- Laguna Niguel State; California Contact No: + 9410004015	[+][0-9]{1} ?[0-9] (From Right)	9410004015	Matches zero or more characters from the set 0123456789
Ted@gmail.com, City: Laguna Niguel, Contact No: +91 9410000157	[A-Za-z0-9]{1,}@[a-zA-Z]{5,}.[a-zA-Z]{2,}	Ted@gmail.com	1. Matches one character from the English alphabet (both cases), digits, “+”, “_”, “.” and, “-” , 2. @ symbol, 3. Matches one character from the English alphabet (both cases), digits, “.” and “–” after the @ symbol.

Note: This article is based on Smarten Version 5.0. This may or may not be relevant to the Smarten version you may be using.

Tags: Regex, Dataset columns, Split Column, Address Column, Special Characters, Regex Expression, Column Extraction, Name Extraction, City Extraction, State Extraction, Contact Number Extraction, Regex Matching, Split by Regex, Regex Engine, Character Matching, Pattern Matching, Comma Separator, Semicolon Separator, Hyphen Separator, String Processing, Regular Expression Syntax, Data transformation, Email Extraction, Contact No Pattern, Regex Examples, Data Splitting, Expression Logic, Dataset Manipulation, Phone Number Regex, Address Parsing, Data Splitting Logic, Data Column Naming,