Split the dataset columns using Regex

Modified on Thu, 21 Jan 2021 at 01:13 PM

Scenario:

The data contains Address column that has Name, City, Email, Contact Number, etc. separated with special characters. 

Requirement:

Split the column using Regex to generate different columns for Name, City, Contact Number, etc.


Sample Data


In the below given dataset, the Address column values contain Name, City, State, etc. in a single line with different separators such as ‘,’, ‘;’, ‘-’, etc. 

For example, values like “Name: Mike Pelletier, City- Laguna Niguel State; California Contact No: +1 9410004015”.

The different parts of this value are;

Name: Mike Pelletier,

City- Laguna Niguel

State; California

Contact No: +1 9410004015

The separators used to split the values are “:, ,, -, :”.



 

Split the Name from the Address column 


Name is given with different separators like {;,:,-,,} as shown below. We will be splitting the name and creating a column with a single regex expression here. 


1. To split the Name column, right click on the Address column - >Split ->Split to Column

Split Column panel will be displayed as below. 


                                                                                            SPLIT COLUMN PANEL


2. Select Regex optionGive the Expression as “[\,\;\-]”. Provide new column name and Click APPLY.

The best fit expression that is used to fulfil our requirement is [\;\,\-].

\; - This will match “;” character.

\, - This will match “,” character.

\- - This will match “-” character. 

The result will be split to a new Name column from where the regex expression has matched. As per the regex expression given, the first matching character was a Comma (,) which occurred after the Name field in the string. So the name column was extracted from the whole line;

Name: Mike Pelletier.


                                                                        RESULT WITH COLUMN NAME VALUES


To remove the characters “Name:” from the column value of NAME_H, use the same regex expression (“[\,\;\-]”) again to split. 

The final result will have extracted name values as below. Do the same steps for extracting other columns like City, State, Contact number, etc. 

 

                                                                        RESULT WITH NAME VALUES EXTRACTED

Similarly for extracting Contact number follow the below steps.

Split the Contact number from the Address column


                                                    ADDRESS COLUMN HAVING CONTACT NO VALUES


Address values given are like “Name: Mike Pelletier, City- Laguna Niguel State; California Contact No: +1 9410004015”.

We will be splitting the contact number (+1 9410004015) and creating a column with a single regex expression - [+][0-9]{1} ?[0-9].


                                                        SPLITTING BY REGEX

The result would split the contact number to a different column.


                                                                DATASET AFTER SPLITTING MOBILE NUMBER COLUMN


Basics of Regex expression engine and examples: 


Basically a regex-directed engine processes regular expressions, trying to match the pattern to the given string.  If a match is found, it advances through the regex and the subject string. If the characters fail to match, it picks up the next position in the regex and the subject string where it can try a different path through the regex.

Some examples are given as below.


String example

Regex Expression

Result

Logic

Name:  Mike Pelletier, City- Laguna Niguel State; California Contact No: + 9410004015

[\,\;\-]

(From Left)

Name:  Mike Pelletier

The string will be split from the Comma (,) as per the regex rule.

Name:  Mike Pelletier, City- Laguna Niguel State; California Contact No: + 9410004015

[+][0-9]{1} ?[0-9]

(From Right)

9410004015

Matches zero or more characters from the set 0123456789

Ted@gmail.com,  City: Laguna Niguel, Contact No: +91 9410000157

[A-Za-z0-9]{1,}@[a-zA-Z]{5,}.[a-zA-Z]{2,}

Ted@gmail.com

1. Matches one character from the English alphabet (both cases), digits, “+”, “_”, “.” and, “-” ,

2. @ symbol, 

3. Matches one character from the English alphabet (both cases), digits, “.” and “–” after the @ symbol.

 




Note: This article is based on Smarten Version 5.0. This may or may not be relevant to the Smarten version you may be using.


Was this article helpful?

That’s Great!

Thank you for your feedback

Sorry! We couldn't be helpful

Thank you for your feedback

Let us know how can we improve this article!

Select atleast one of the reasons

Feedback sent

We appreciate your effort and will try to fix the article