Optimus column operations

Hi, this notebook will show you some column operation available in Optimus.

Create dataframe

Create Columns

Create a column with a constant value

Append one or multiples columns from dataframes

Create multiple columns with a constant value

Create multiple columns with a constant string, a new column with existing columns value and an array

Select columns

Select columns with a Regex

Select all the columns of type string

Rename Column

Rename multiple columns and uppercase all the columns

Convert to lower case

Convert to uppercase

Cast a columns

This is a opinionated way to handle column casting. One of the first thing that every data cleaning process need to acomplish is define a data dictionary. Because of that we prefer to create a tuple like this:

df.cols().cast( [("words","str"), ("num","int"), ("animals","float"), ("thing","str")] )

Cast a column to string

Cast all columns to string

Move columns

Sorting Columns

Sort in Alphabetical order

Sort in Reverse Alphabetical order

Drop columns

Drop one column

Drop multiple columns

Chaining

cols and rows accessors are used to organize and encapsulate optimus methods, it can be helpfull when you look at the code because every line is self explained.

The past transformations were done step by step, but this can be achieved by chaining all operations into one line of code, like the cell below. This way is much more efficient and scalable because it uses all optimization issues from the lazy evaluation approach.

Unnest Columns

With unnest you can convert one column into multiple ones. it can hadle strings and arrays.

Only get the first element

Unnest array of string

Unnest and array of ints

Spits in 3 parts

Impute

Fill missing data

Set values using user defined functions

Set a value only to numeric values in filter

Sometimes there are columns with numeric and string values together.

In order to solve this problem, set.numeric() function can be used to operate over just one of those types.

In the next example we replace ever number with a string "new string"

Or you could pass the value directly

Sum a numeric value (20 in this case) to two columns

Select rows where filter is an integer

Create an abstract dataframe to filter rows where the value of num is greater than 1

Create an UDF with two arguments and pass it to df.cols.set

Set a value where the values of num and num 2 are both greater than 2

Count Nulls

Count uniques

Unique

Count Zeros

Column Data Types

Replace

Replace "dog" and "cat" in animals by the string "animals"

Replace "dog-tv", "cat", "eagle" and "fish" in columns two strings and animals by the string "animals"

Replace "dog" by "dog_1" and "cat" by "cat_1" in animals

Replace "dog" by "pet" in animals

Replace "a", "b" and "c" by "%" in all columns

Replace 3 and 2 by 10 in a numeric column

Replace 3 by 6 and 2 by 12 in a numeric column

Replace as words

Replace using a regular expression

Nest

Merge two columns in a string column

Merge two columns in a column vector

Merge three columns in an array

Histograms

Statistics

Quantile Statistics

Descriptive Statistics

Calculate Median Absolute deviation

Calculate precentiles

Calculate Mode

String Operations

Calculate the interquartile range

Cleaning and Date Operations Operations

Years between a date and today (No other date is passed)