Skip to content

Data Transformation

This comprehensive guide covers data transformation capabilities in UnDatasIO, including data cleaning, validation, and transformation operations. Learn how to prepare your data for analysis and ensure data quality.

Data transformation is the process of converting, cleaning, and structuring your data to make it suitable for analysis, reporting, or integration with other systems. UnDatasIO provides a comprehensive set of tools for data transformation.

Missing values can significantly impact data quality and analysis results. UnDatasIO provides several strategies for handling missing data:

// Remove rows where any column has missing values
const cleanedData = await client.cleanData(data, {
missingValueStrategy: "remove",
});
// Remove rows where specific columns have missing values
const cleanedData = await client.cleanData(data, {
missingValueStrategy: "remove",
columns: ["email", "phone"],
});
// Fill missing values with default values
const filledData = await client.fillMissingValues(data, {
strategies: [
{
column: "age",
strategy: "fill",
value: 0,
},
{
column: "city",
strategy: "fill",
value: "Unknown",
},
{
column: "email",
strategy: "fill",
value: "no-email@example.com",
},
],
});
// Use statistical methods to fill missing values
const interpolatedData = await client.interpolateMissingValues(data, {
strategies: [
{
column: "price",
method: "mean", // or 'median', 'mode'
},
{
column: "date",
method: "forward_fill", // or 'backward_fill'
},
],
});

Remove duplicate records to ensure data integrity:

// Remove exact duplicates
const uniqueData = await client.deduplicateData(data, {
columns: ["email", "phone"],
strategy: "keep_first",
});
// Remove fuzzy duplicates (similar records)
const fuzzyUniqueData = await client.deduplicateData(data, {
columns: ["name", "email"],
strategy: "fuzzy",
similarity: 0.8, // 80% similarity threshold
});

Standardize data formats for consistency:

// Standardize text data
const standardizedData = await client.standardizeData(data, {
operations: [
{
column: "name",
operation: "title_case", // Capitalize first letter of each word
},
{
column: "email",
operation: "lowercase",
},
{
column: "phone",
operation: "format",
format: "(###) ###-####",
},
],
});

Validate data against predefined schemas:

// Define a schema for user data
const userSchema = {
type: "object",
required: ["id", "name", "email"],
properties: {
id: {
type: "integer",
minimum: 1,
},
name: {
type: "string",
minLength: 1,
maxLength: 100,
},
email: {
type: "string",
format: "email",
},
age: {
type: "integer",
minimum: 0,
maximum: 150,
},
phone: {
type: "string",
pattern: "^\\+?[1-9]\\d{1,14}$",
},
},
};
// Validate data against schema
const validationResult = await client.validateData(data, {
schema: userSchema,
strict: true,
});
console.log("Validation errors:", validationResult.errors);
console.log("Valid records:", validationResult.validCount);
console.log("Invalid records:", validationResult.invalidCount);

Create custom validation rules for your specific needs:

// Custom validation functions
const customValidations = [
{
column: "email",
rule: (value) => {
return value.includes("@") && value.includes(".");
},
message: "Email must contain @ and .",
},
{
column: "age",
rule: (value) => {
return value >= 18 && value <= 65;
},
message: "Age must be between 18 and 65",
},
{
column: "salary",
rule: (value) => {
return value > 0 && value <= 1000000;
},
message: "Salary must be positive and less than 1M",
},
];
const validatedData = await client.validateData(data, {
customRules: customValidations,
});
// Rename columns for better clarity
const renamedData = await client.renameColumns(data, {
mappings: {
fname: "first_name",
lname: "last_name",
ph: "phone",
addr: "address",
},
});
// Split a full name column into first and last name
const splitData = await client.splitColumn(data, {
column: "full_name",
delimiter: " ",
newColumns: ["first_name", "last_name"],
});
// Split address into components
const addressData = await client.splitColumn(data, {
column: "address",
delimiter: ",",
newColumns: ["street", "city", "state", "zip"],
});
// Combine first and last name into full name
const combinedData = await client.combineColumns(data, {
columns: ["first_name", "last_name"],
newColumn: "full_name",
separator: " ",
});
// Create a formatted address
const formattedData = await client.combineColumns(data, {
columns: ["street", "city", "state", "zip"],
newColumn: "formatted_address",
separator: ", ",
format: "{street}, {city}, {state} {zip}",
});

Convert data types for proper analysis:

// Convert data types
const convertedData = await client.convertDataTypes(data, {
conversions: [
{
column: "age",
type: "integer",
},
{
column: "price",
type: "decimal",
precision: 2,
},
{
column: "date",
type: "date",
format: "YYYY-MM-DD",
},
{
column: "is_active",
type: "boolean",
trueValues: ["yes", "true", "1"],
falseValues: ["no", "false", "0"],
},
],
});
// Convert text case
const caseData = await client.transformText(data, {
operations: [
{
column: "name",
operation: "uppercase",
},
{
column: "description",
operation: "lowercase",
},
{
column: "title",
operation: "title_case",
},
],
});
// Clean text data
const cleanedText = await client.cleanText(data, {
operations: [
{
column: "description",
operations: ["trim", "remove_extra_spaces", "remove_special_chars"],
},
{
column: "comments",
operations: ["remove_html_tags", "normalize_whitespace"],
},
],
});
// Extract patterns using regex
const extractedData = await client.extractPatterns(data, {
extractions: [
{
column: "phone",
pattern: "(\\d{3})-(\\d{3})-(\\d{4})",
newColumns: ["area_code", "prefix", "line_number"],
},
{
column: "email",
pattern: "([^@]+)@([^@]+)",
newColumns: ["username", "domain"],
},
],
});

Apply transformations based on conditions:

// Conditional transformations
const conditionalData = await client.applyConditionalTransformations(data, {
rules: [
{
condition: { column: "age", operator: "<", value: 18 },
transformations: [
{ column: "category", value: "minor" },
{ column: "can_vote", value: false },
],
},
{
condition: { column: "salary", operator: ">", value: 100000 },
transformations: [{ column: "tax_bracket", value: "high" }],
},
],
});

Summarize data with aggregation functions:

// Group and aggregate data
const aggregatedData = await client.aggregateData(data, {
groupBy: ["department", "location"],
aggregations: [
{
column: "salary",
function: "average",
alias: "avg_salary",
},
{
column: "salary",
function: "sum",
alias: "total_salary",
},
{
column: "id",
function: "count",
alias: "employee_count",
},
{
column: "salary",
function: "min",
alias: "min_salary",
},
{
column: "salary",
function: "max",
alias: "max_salary",
},
],
});

Create pivot tables for data analysis:

// Create pivot table
const pivotTable = await client.createPivotTable(data, {
index: "department",
columns: "location",
values: "salary",
aggfunc: "average",
fillValue: 0,
});

Assess the quality of your data:

// Generate quality report
const qualityReport = await client.assessDataQuality(data, {
metrics: [
"completeness",
"accuracy",
"consistency",
"timeliness",
"validity",
],
});
console.log("Data Quality Report:");
console.log("Completeness:", qualityReport.completeness);
console.log("Accuracy:", qualityReport.accuracy);
console.log("Consistency:", qualityReport.consistency);
console.log("Timeliness:", qualityReport.timeliness);
console.log("Validity:", qualityReport.validity);

Generate detailed data profiles:

// Profile your data
const profile = await client.profileData(data, {
include: ["statistics", "distributions", "correlations", "outliers"],
});
console.log("Data Profile:");
console.log("Column Statistics:", profile.statistics);
console.log("Value Distributions:", profile.distributions);
console.log("Column Correlations:", profile.correlations);
console.log("Outlier Detection:", profile.outliers);

Chain multiple transformations together:

// Create a transformation pipeline
const pipeline = client.createPipeline([
{
name: "clean_data",
operation: "clean",
options: {
removeDuplicates: true,
handleMissingValues: "fill",
},
},
{
name: "standardize_text",
operation: "standardize",
options: {
columns: ["name", "email"],
case: "lowercase",
},
},
{
name: "validate_data",
operation: "validate",
options: {
schema: userSchema,
},
},
{
name: "transform_types",
operation: "convert_types",
options: {
conversions: [
{ column: "age", type: "integer" },
{ column: "salary", type: "decimal" },
],
},
},
]);
// Execute the pipeline
const transformedData = await pipeline.execute(data);
// Save pipeline for reuse
await pipeline.save("user_data_cleaning");
// Load and use saved pipeline
const savedPipeline = await client.loadPipeline("user_data_cleaning");
const result = await savedPipeline.execute(newData);
  1. Understand your data before transformation
  2. Document your transformations for reproducibility
  3. Test transformations on sample data first
  4. Validate results after each transformation step
  1. Use appropriate data types for better performance
  2. Batch transformations when possible
  3. Cache intermediate results for repeated operations
  4. Monitor memory usage for large datasets
  1. Implement comprehensive validation rules
  2. Handle edge cases and unexpected data
  3. Maintain data lineage for audit purposes
  4. Regular quality checks on transformed data

Memory Issues:

  • Use streaming for large datasets
  • Process data in chunks
  • Optimize data types

Performance Issues:

  • Use efficient transformation operations
  • Enable caching where appropriate
  • Consider parallel processing

Data Quality Issues:

  • Implement validation rules
  • Handle missing values appropriately
  • Check for data type consistency

After mastering data transformation:

  1. Explore advanced analytics features
  2. Set up automated pipelines for recurring transformations
  3. Integrate with external systems using APIs
  4. Implement data governance practices

For more advanced topics, check out: