Combined Data Cleansing

Combined Data Cleansing is located under Model Studio ( ) in Data Preparation, in the left task pane. Use the drag-and-drop method to use the algorithm in the canvas. Click the algorithm to view and select different properties for analysis.

Refer to Properties of Combined Data Cleansing.

Properties of Combined Data Cleansing

The properties of Combined Data Cleansing are as shown below.

The table below describes different fields present in the properties of Combined Data Cleansing.

Field	Description	Remark
Run	It allows you to run the node.	-
Explore	It allows you to explore the successfully executed node.	-
Vertical Ellipse	The available options are Run till node Run from node Publish as a model Publish code	-
Task Name	It is the name of the task selected on the workbook canvas.	You can click the text field to edit or modify the task's name. Space between words is not allowed in the Task Name.
Data Fields	It allows you to select the columns on which you want to apply the cleansing methods.	You can select multiple columns for data cleansing. You can select a categorical or numerical column for data cleansing.
Cleansing Methods	It allows you to select the cleansing method to apply to the data fields.	Click to open a new window and select the appropriate method. There are several methods, each containing multiple sub-methods for specialized cleansing of data. You can select multiple methods. However, you can select only one sub-method from each method. For more information, read the table below.
Node Configuration	It allows you to select the instance of the AWS server to provide control over the execution of a task in a workbook or workflow.	For more details, refer to Worker Node Configuration.

The table below describes the various data cleansing methods and their corresponding sub-methods.

Method	Sub-Method	Description
Remove Unwanted Characters	Remove leading and trailing whitespaces	Remove any extra whitespace(s) before or after a string, word, or number.
	Remove tabs, line breaks, and duplicate whitespaces	Remove any extra tabs, line breaks, or repeated whitespaces in a string.
	Remove all whitespaces	Remove leading, trailing, in-between or duplicate whitespaces.
	Remove letters	Remove alphabet(s) from a string or a number
	Remove numbers	Remove a number(s) from strings or words
	Remove punctuations	Remove all punctuation marks from strings, words, or numbers
Modify Case	Upper Case	Convert all letters or words in a string to upper case.
	Lower Case	Convert all letters or words in a string to lower case.
	Title Case	Capitalize each word in a phrase or a sentence.
Replace Null Data	Replace with blanks (string columns)	Replace any null string value in a cell with a blank space. It renders an empty cell.
Replace Null Data	Replace with zero (numeric columns)	Replace any null value in a numerical column with zero (0).
Remove Null Data	Remove Null Rows	Remove all rows with a null value in any column
	Remove Null Rows	Only remove rows that have a null value in every column
	Remove Null Columns	Remove all columns with a null value in any row
	Remove Null Columns	Only remove columns that have a null value in every row

Note:

By default, the following values are interpreted as Not a Number (NaN):

'', '#N/A', '#N/A N/A', '#NA', '-1.#IND', '-1.#QNAN', '-NaN', '-nan', '1.#IND', '1.#QNAN', '<NA>', 'N/A', 'NA', 'NULL', 'NaN', 'n/a', 'nan', 'null'.

Example of Combined Data Cleansing

Consider a dataset with multiple irregularities present in various cells.
A snippet of the input data is shown below. The following irregularities are appropriately highlighted as shown.

Number and Strings with Special Characters
Empty Cells
Null Values in Columns
Line Break between Words
Leading and Trailing Whitespaces

We apply various Data Cleansing methods to the above data and explore the results. The following scenarios display the effect of the sub-methods.

Method 1: Remove Unwanted Characters

Sub-Method 1: Remove leading and trailing whitespaces

Applied on Column: Whitespace11

Similarly, you can use the Remove all whitespaces sub-method to remove leading, trailing, in-between, and duplicate whitespaces.

Sub-Method 2: Removing Line Break

Applied On Column: 1Item Type

Sub-Method 3: Removing letters

Applied On Column: Units Sold

Sub-Method 4: Removing numbers

Applied On Column: Country

Sub-Method 5: Removing punctuations

Applied On Column: Country

Method 2: Modify Case

Sub-Method 1: Upper Case

Applied to Column: Sales Channel

Sub-Method 2: Lower Case

Applied to Column: Sales Channel

Sub-Method 3: Title Case

Applied to Column: Whitespace11

Method 3: Replace Null Data

Sub-Method 1: Replace with blanks (string columns)

Applied to Column: Item Type

Sub-Method 2: Replace with zero (numeric columns)

Applied to Column: Order ID

Another snippet of input data shown below contains the following irregularities as appropriately labeled.

Rows/Columns with NaN Value
Empty Cells and Cells with Empty Spaces
Rows and Columns with all NaN and Empty values

We apply the Remove Null Data method to the above data and explore the results. The following scenarios display the effect of the sub-methods.

Method 4A: Remove Null Data (Remove Null Rows)

Sub-Method 1: Remove all rows with a null value in any column

Since all columns contain at least one null value, all the rows are deleted.

Sub-Method 2: Only remove rows that have a null value in every column

The original data contains fifteen rows of data. Out of these, the first two rows contain a null value in every column.
The image below shows that the first two rows are removed. Hence, only thirteen rows remain in the output.

Method 4B: Remove Null Data (Remove Null Columns)

Sub-Method 1: Remove all columns with a null value in any row

Since all rows contain at least one null value, all the columns are deleted.

Sub-Method 2: Only remove columns that have a null value in every row

In the original data, out of the nine columns containing data, four columns, namely NullString, NANSTRING, NullString1, and NANSTRING1 contain null values in every row. They also contain empty cells which are considered NaN values.
The image below shows that the above-mentioned four columns are removed. Hence, only five columns remain in the output.

Related Articles
Combined Data Cleansing
Combined Data Cleansing is located under Model Studio ( ) in Data Preparation, in the left task pane. Use the drag-and-drop method to use the algorithm in the canvas. Click the algorithm to view and select different properties for analysis. Refer to ...
Data Compare
Data Compare is located under Model Studio ( ) in Data Preparation, in the task pane on the left. Use the drag-and-drop method to use the algorithm in the canvas. Click the algorithm to view and select different properties for analysis. In the Data ...
Live Data
Data visualization is the representation of data in the form of pictures, images, graphs, or any other form of visual illustration. In RubiThings, Live Data is visualized in the form of Line Chart, SolidGauge, and Speedometer. To fetch the live data, ...
Data Merge
Data Merge is located under Model Studio ( ) in Data Preparation, in the task pane on the left. Use drag-and-drop method to use the algorithm in the canvas. Click the algorithm to view and select different properties for analysis. Properties of Data ...
Data Merge
Data Merge is located under Model Studio ( ) in Data Preparation, in the task pane on the left. Use drag-and-drop method to use the algorithm in the canvas. Click the algorithm to view and select different properties for analysis. Properties of Data ...

Combined Data Cleansing

Combined Data Cleansing

Properties of Combined Data Cleansing

Example of Combined Data Cleansing

Method 1: Remove Unwanted Characters

Sub-Method 1: Remove leading and trailing whitespaces

Sub-Method 2: Removing Line Break

Sub-Method 3: Removing letters

Sub-Method 4: Removing numbers

Sub-Method 5: Removing punctuations

Method 2: Modify Case

Sub-Method 1: Upper Case

Sub-Method 2: Lower Case

Sub-Method 3: Title Case

Method 3: Replace Null Data

Sub-Method 1: Replace with blanks (string columns)

Sub-Method 2: Replace with zero (numeric columns)

Method 4A: Remove Null Data (Remove Null Rows)

Sub-Method 1: Remove all rows with a null value in any column

Sub-Method 2: Only remove rows that have a null value in every column

Method 4B: Remove Null Data (Remove Null Columns)

Sub-Method 1: Remove all columns with a null value in any row

Sub-Method 2: Only remove columns that have a null value in every row

Related Articles

Combined Data Cleansing

Data Compare

Live Data

Data Merge

Data Merge