Data Source Configuration

Upriver allows you to add and monitor any data source with just a few clicks.

Configuration Attributes

Each data source has the following attributes that make up it's configuration.

Full data source configuration page

Basic Configuration

Name

This is the name with which the data source is displayed in both the data set catalog as well as any other pages (such as incidents and notifications).

This name can can be changed in order to provide a name that is more meaningful than the name given to the physical storage.

Owners

You can assign users to a data source by selecting from any of the users who have access to the platform. This feature enhances collaboration by clearly identifying who is responsible for each data source.

When a user is assigned as the owner, they will be tagged and notified whenever incidents occur related to their data source, ensuring timely attention and resolution.

Type

Select the data source type from the supported types. For each data source type, a different sub-menu will appear with the relevant details for the data source.

S3

Bucket - The name of the given bucket or the path to it (e.g. s3://bucket-name).

Bucket Prefix - The prefix inside the bucket which needs to be monitored.

Region - The region where the bucket is set.

Format - The format of the data in the given prefix (can be JSON, Parquet, CSV, Iceberg)

BigQuery

Integration - Choose the integration configured (find instructions in BigQuery).

Dataset - Name of the BigQuery dataset.

Table - Name of the table.

For example given the following table upriver-proj.product_analytics.events , integration would be the one set on upriver-proj , dataset would be product_analytics and table would be events.

Kinesis

Stream Name - The name of the stream containing the data.

Region - The region where the stream is set.

GCS

Bucket - The name of the given bucket or the path to it (e.g. s3://bucket-name).

Bucket Prefix - The prefix inside the bucket which needs to be monitored.

Region - The region where the bucket is set.

Format - The format of the data in the given prefix (can be JSON, Parquet, CSV, Iceberg)

Snowflake

Database - The name of the Snowflake database.

Schema - The schema within the database.

Table - The name of the table to be monitored.

Confluent Kafka

Integration - Choose the integration configured (find instructions in Confluent Kafka).

Topic - Stream topic.

Table - Group ID of the consumer.

Redshift

Integration - Choose the integration configured (find instructions in Redshift).

Database - The database the table is in.

Schema - The schema the table is in.

Table - The table to monitor.

Advanced Configuration

Sampling Rate

The percentage of the data that will be used to create the profile. Users can either decide the percentage of the data (0-1) or select dynamic sampling.

If the user decides to use Dynamic Sampling , Upriver's algorithms decide the sampling rate based on the volume of data and the distribution it sees.

Update Time

The interval in which the data analysis process will run. Users can either select one of the available hours from the drop down, or select dynamic update.

If the user decides to use Dynamic Update , Upriver's algorithms decide the relevant interval by analyzing the update patterns seen for the data source.

Pivot Fields

The name of the columns used to create the different "pivots". See Pivot Fields.

After each allowed value, make sure to click on Add or press the Enter button on your keyboard.

Currently only up to two different pivot fields can be defined.

Filter

A filter can be used to only monitor parts of the data where a certain condition holds true. In order to use the filter, the user needs to enter the name of field/column they wish to filter and the desired value.

Currently only a single value filter is available. Support for multiple filters is under development.

Cardinality Threshold

The number of distinct values needed to determine if a field is categorical or not.

Fields with less distinct values than the set threshold will be considered categorical.

Default Incident Severity

The user can set the default severity that will be assigned to incidents related to the datasource. See Managing Incidents

Staleness Threshold

The maximum number of days a datasource can go without seeing data before it is considered "stale". Stale data sources are considered those not being updated.

Timestamp Column

Specifies the name of the column to be used as the timestamp for each row in the table.

External ID

A user-defined identifier that represents the ID used in the its systems.

Webhook

A URL where incidents will be sent. Enables integration with external systems by forwarding incident data in real time.

Nullify Empty Strings (Checkbox)

If enabled, empty strings ("") will be treated as empty fields when calculating the completeness of a field.

Nullify Empty Arrays (Checkbox)

If enabled, empty arrays ([]) will be treated as empty fields when calculating the completeness of a field.

Last updated