Upriver supports monitoring data sources in GCP in both SaaS and hybrid deployment models.
Explanation regarding how to set up your environment is provided in this page. In either case, Upriver will provide you with a terraform script to run to set up your environment for working with Upriver.
To start setting up your environment, please contract an Upriver representative for a service account and up-to-date terraform script.
SaaS
In a SaaS deployment a data viewer service account is created in a project inside your google account, which will then be granted access to any data sources you wish to monitor. Then, a service account managed by Upriver is granted the serviceAccountTokenCreator permission for the data viewer service account, to allow Upriver to impersonate the data viewer and access the data.
This process is automated using a Terraform script provided by Upriver.
In order to give a service account from Upriver's google account permissions on the data viewer account, you may need to update the "Domain restricted sharing" configuration and add Upriver's google customer ID to your allowed customer ID list.
After running the terraform you'll need to provide it's outputs to an Upriver representative, who will do the final setup for your account.
To set up the terraform, you need to define it's variables with the following values:
upriver_service_account - the service account provided to you by an upriver representative.
service_account_project - The project id of the project that the data viewer service account should be created in.
prefix - The prefix to use for the service account name.
monitored_projects - The project ids of the GCP projects that the service account should have access to. Any project that the service account should have access to should be added here.
is_bigquery_account - Controls whether the service account will have acess to bigquery or not. If set to false, other variables related to bigquery access will be ignored.
should_access_entire_bigquery - If true, the service account will be granted the "roles/bigquery.dataViewer" role on the project. This role allows the service account to access any table within the project's bigquery. If you would rather allow access to specific datasets/tables, set this to false and use bigquery_dataset_ids or bigquery_table_ids instead.
format:
{ dataset_id=<dataset_id>, project=<project_id>}
bigquery_table_ids - Specific tables that the service account should have access to. Access is provided using the "roles/bigquery.dataViewer" role. This access is provided using google_bigquery_table_iam_member, which can cause issues if you are using other table access methods. If you're using other methods in your terraform, you should use should_access_entire_bigquery or give permissions to the service account yourself. This can be used alongside bigquery_dataset_ids to give access to whole datasets at once, and also specific tables from other datasets. Note: the dataset and table ids should not include the project/dataset id as a prefix.
format:
{ table_id=<table_id>, dataset_id=<dataset_id>, project=<project_id>}
should_access_entire_storage - If true, the service account will be granted the "roles/storage.objectViewer" role on the project. This role allows the service account to access any bucket within the project. If you would rather allow access to specific buckets, set this to false and use storage_bucket_ids instead.
storage_buckets - Specific buckets that the service account should have access to. Access is provided using the "roles/storage.objectViewer" role. This access is provided using google_storage_bucket_iam_member, which can cause issues if you are using other bucket access methods. If you're using other methods in your terraform, you should use should_access_entire_storage or give permissions to the service account yourself.
storage_folders:
Specific folders in a bucket that the service account should have access to. Access is provided using the "roles/storage.objectViewer" role. The folder must be a managed folder to allow access management on the folder level. We won't turn the folder into a managed folder as part of this module, so that removing the folder from this module won't remove it's managed status. This is done to avoid harming other services who relay on permissions for that specific folder. This access is provided using google_storage_managed_folder_iam_member, which can cause issues if you are using other bucket access methods. If you're using other methods in your terraform, you should use should_access_entire_storage or give permissions to the service account yourself.
fromat:
{ bucket_name=<bucket_name>, path_to_folder=<path_to_folder> }
Hybrid
In a hybrid deployment, Upriver will create a Databricks workspace in one of your projects to run the monitoring from. This process involves the creation of multiple service accounts, GCS buckets and computes, and requires giving an Upriver service account the ability to manage IAM permissions in the given project. For that reason, we recommend creating a dedicated project for Upriver, in which these permissions will be granted. This project will need to have the compute API enabled in it, as it is required for the creation of a workspace.
In addition, a data viewer service account will be created just like in SaaS deployment. However, Upriver will only be able to impersonate the service account from the Databricks workspace in your project, and won't have permissions to impersonate it otherwise.
After running the terraform you'll need to provide it's outputs to an Upriver representative, who will create the Databricks workspace and finish setting up your account.
To set up the terraform, you need to define it's variables with the following values:
upriver_databricks_service_account - The service account that Databricks will use to create and manage its resources. This service account has the permission required by databricks. This service account will be provided to you by an Upriver representative.
databricks_project - The project that the databricks workspace should be created in. We recommend creating a project specifically for our Databricks workspace, since Databricks requires permissions to create and grant permissions to the service account it creates.
prefix - The prefix to use for the service account name.
monitored_projects - The project ids of the GCP projects that the service account should have access to. Any project that the service account should have access to should be added here.
is_bigquery_account - Controls whether the service account will have acess to bigquery or not. If set to false, other variables related to bigquery access will be ignored.
should_access_entire_bigquery - If true, the service account will be granted the "roles/bigquery.dataViewer" role on the project. This role allows the service account to access any table within the project's bigquery. If you would rather allow access to specific datasets/tables, set this to false and use bigquery_dataset_ids or bigquery_table_ids instead.
formay:
{ dataset_id=<dataset_id>, project=<project_id>}
bigquery_table_ids - Specific tables that the service account should have access to. Access is provided using the "roles/bigquery.dataViewer" role. This access is provided using google_bigquery_table_iam_member, which can cause issues if you are using other table access methods. If you're using other methods in your terraform, you should use should_access_entire_bigquery or give permissions to the service account yourself. This can be used alongside bigquery_dataset_ids to give access to whole datasets at once, and also specific tables from other datasets. Note: the dataset and table ids should not include the project/dataset id as a prefix.
format:
{ table_id=<table_id>, dataset_id=<dataset_id>, project=<project_id>}
Enabling access management in GCS
In order to use Upriver only on specific folders inside a GCS bucket, access management needs to be enabled for the specific product. Below is an explanation of how to enable this from within the GCP console.