Duplicate Records Handling / Deduplication

Overview

A new Deduplication feature node has been introduced under the Data Preparation module to identify and remove duplicate records based on full rows or selected columns.

Feature Enhancements

Added Deduplication option under the Data Preparation module.
Users can remove duplicates using entire rows or selected columns.
Multiple column selection supported for deduplication.
Support added for Keep First, Keep Last, and Remove All duplicate handling strategies.
Dataset validation added before execution.
Record count updates after deduplication execution.
All predecessor columns are retained in the output dataset.

Deduplication Module

Deduplication option available under Data Preparation.

Deduplication Configuration

Users can configure duplicate handling columns and strategies.

Benefits

Improved data quality and consistency.
Flexible duplicate handling options.
Better preprocessing support for analytics workflows.
Simplified duplicate record management.

Related Articles
View Log Screen Enhancement
1. Introduction The View Log screen enhancement provides improved filtering, sorting, and workflow-specific visibility within the log panel. These improvements enable users to efficiently analyze workflow and node-level execution details. 2. Feature ...
User Defined Widget Name
The Widget Name Formatter allows users to assign meaningful, custom names to widgets across all chart and widget types in RubiSight dashboards. This helps in easy identification of widgets during interactions such as filtering, interactivity control, ...
Column Lineage
Overview Column Lineage helps you understand how your data moves and transforms across your pipeline. This guide combines all lineage features—generation and visual tree display—into one simple explanation. 1. Lineage Generation (Pipeline) - Accessed ...
Writing to Template File
You can store the result of algorithm flow or the Reader into a Text dataset. You can use the TemplateFile node to create target file datasets within the application. These target files are stored in Text format and can be reused as Text dataset ...
Workflow : Workbench : Approve
1. Introduction The Approve node in the Workflow module is designed to support approval-driven governance flows. It pauses workflow execution until the assigned approver reviews the selected dashboard and chooses Approve, Reject, or Investigate. The ...

Duplicate Records Handling / Deduplication

Duplicate Records Handling / Deduplication

Overview

Feature Enhancements

Deduplication Module

Deduplication Configuration

Benefits

Related Articles

View Log Screen Enhancement

User Defined Widget Name

Column Lineage

Writing to Template File

Workflow : Workbench : Approve