Skip to main content

Preparing Data for Motif

In this section we cover some standard SQL queries for Sharing Data Source Configuration. We will make the following assumptions:

  1. Your data is stored in a Parquet file called input_events.parquet (we'll modify this slightly for some other cases).
  2. We intend to produce one single parquet file with at least three columns: actor, ts, and name (which is what they are referred to in Motif). We prefer parquet in Motif because it is efficient and there is no ambiguity about column types, as in CSV files.
  3. We will use DuckDB's SQL dialect, which has some useful features for our purposes. You can use DuckDB from the command-line quite easily: brew install duckdb.

All-in-one example

This example query illustrates many of the ideas in subsequent sections and provides an overview of all the ideas for Sharing Data Source Configuration. For more detail, refer to the sections below.

with combined_events as (
-- remove a dimension that isn't needed
select * exclude(not_needed_dimension1)
from "input_events_source1.parquet"
-- merge two sources using special union operation
union all by name
select
user_id as actor, -- rename to actor
event as name, -- rename to name
epoch_ms("timestamp") as ts, -- convert timestamps
-- use JSON extraction operator for a nested column
properties->>'$.nested_value.app' as app_version
from
-- consume from json source
read_ndjson_auto("input_events_source2.json")
),
user_dimensions as (
select
user_id as actor,
email,
age,
signup_date,
company
)
select *
from combined_events ce
left join user_dimensions ud
on ce.actor = ud.actor
where
-- take a 1% sample of actors
mod(hash(actor), 100) < 1
-- remove nuisance events
and name not in ('nuisance_event1', 'nuisance_event2')
-- filter to a specific time range
and ts between '2023-01-01 00:00:00' and '2023-01-07 23:59:59'

Reading and writing Parquet files

Pandas has a Parquet reader and writer built-in. You can use df.to_parquet("input_events.parquet").

DuckDB can easily convert other formats to parquet:

copy (
select * from "input_events_stored_as_csv.csv"
)
to "input_events.parquet" with (format 'parquet')

DuckDB can read newline-delimited JSON as well:

copy (
select * from read_ndjson_auto('my_events.json')
)
to "input_events.parquet" with (format 'parquet')

Combining two sources of events

The union all by name operation in DuckDB is very powerful for combining two data sets. It will make sure the columns are aligned if they share the same name, but if different event sources have different columns, they are preserved.

select
actor,
ts,
name,
source1_dimension,
overlapping_dimension,
from "input_events_source1.parquet"

union all by name

select
actor,
ts,
name,
source2_dimension,
overlapping_dimension,
from "input_events_source2.parquet"

This query will correctly align the (actor, ts, name, overlapping_dimension) columns, while preserving source1_dimension and source2_dimension, and creating null values where needed.

Sampling actors

Sometimes we would like to sample actors in order to make a data set smaller but preserving the sequential structure. We use a deterministic hash function in order to retain only a fraction of the original data.

This query will take a 1% sample by dividing actors into 100 blocks and taking only the first block.

select
actor,
ts,
name
from "input_events.parquet"
where
mod(hash(actor), 100) < 1

You can change the 100 and 1 to create different sampling rates, or add a salt to the hash by concatenating a string.

You can roughly determine the sampling rate to meet a certain sample size by by counting the average number events per actor:

select
-- change this number to be what you'd like
2000000 as intended_sample_size,
count(distinct actor) as num_actors
count(1) / num_actors as events_per_actor,
(intended_sample_size / events_per_actor) / num_actors as sampling_rate
from "input_events.parquet"

Filtering on time

Another method to make data set smaller is to keep only events from a certain time range. This is straightforward:

select * from "input_events.parquet"
where ts between '2023-01-01' and '2023-01-07'

Filtering out users whose first event was before some time.

We may want to remove users whose sequences started before some threshold.

with eligible_users as (
select actor, min(ts) as first_event from "input_events.parquet"
group by actor
having first_event > '2023-01-01'
)
select * from "input_events.parquet" a
join eligible_users b on a.actor = b.actor

Removing nuisance events

Some kinds of events are repetitive and do not provide much information. It can make sequences shorter and easier to work with if we remove them. The first step is characterizing the most common events:

select name, count(1) as cnt
from "input_events.parquet"
group by name
order by cnt desc
limit 20
select * from "input_events.parquet"
where name not in ('nuisance_event1', 'nuisance_event2')

Joining user dimensions

It's helpful to have user dimensions in Motif, but often they are not included in the event tables. You can join these onto the events so they will be available within Motif.

In this example, the actor id is not the same in both data sets, so we rename in the second one:

with events as (
select * from "input_events.parquet"
),
user_dimensions as (
select
user_id as actor,
*
from "user_dimensions.parquet"
)
select
*
from events e
left join user_dimensions ud
on e.actor = ud.user_id

Notice we do a left-join which means that if the user dimensions are not available, the event will have a null value.

Extracting a nested dimension

Sometimes useful dimension values can be nested. Motif assumes a flat namespace for columns in parquet files.

If you have a json column, you can do pick out specific fields. First you have to install and load the DuckDB extension:

INSTALL 'json';
LOAD 'json';

Now if a column is in json format, you can use the extraction operator to pull out fields:

select
json->>'$.properties.user_id' as actor,
json->>'$.name' as name,
ts
from "input_events.parquet"

If your column is a struct, you can use dot access to pull out nested values:

select
struct_column.actor as actor,
struct_column.name as name,
struct_column.ts as ts
from "input_events.parquet"

Protecting sensitive information

If you would like to avoid sharing sensitive information with Motif, you can either omit or obfuscate dimensions. Here are some short snippets that can be helpful:

Replacing an identifier with a hashed value is helpful when you have a unique identifier that contains private information (like an email address):

select
cast(hash(email_address) as varchar) as actor,
name,
ts
from "input_events.parquet"

Note that the hash function in DuckDB will work with any column type, so you can use it on integers, strings, etc.

Replacing a string with Xs is helpful if there is a string field that you'd like to still understand the structure of the input (e.g. a phone number).

select
regexp_replace(phone_number, '\w', 'X', 'g') as obfuscated_phone_number
from "input_events.parquet"

Mapping a categorical variable to an ordinal rank is helpful if you'd like to be able to do a breakdown but want to retain some user-friendly identifier:

with frequency as (
select
dimension,
count(1) as cnt
from "input_events.parquet"
group by dimension
),
identifier_map as (
select
dimension,
row_number() over (order by cnt) as dimension_id
from frequency
)
select
actor,
ts,
name,
dimension_id
from
"input_events.parquet" a
join identifier_map b
on a.dimension = b.dimension