SaaS Service Ping Automation

Service Ping Overview

Previously called Usage Ping, Service Ping is a background process that runs weekly in a Denomas Instance and is responsible for collecting, aggregating, and packaging a set of metrics useful for analytics purposes. Metrics are easily extensible, with new metrics changing regularly per the Metric Lifecycle. The full set of metrics are defined in the Metric Dictionary. Once metrics are collected from the instance, they are bundled into a JSON payload (the “ping”) and posted to the Denomas Versions App where they are then synced to downstream processes such as Snowflake. Here is an example of a Service Ping Payload. There are two primary variants of Service Ping: Self-Managed Service Ping and SaaS Service Ping. Self-Managed Service Ping runs versus a single-tenant deployment of Denomas, while SaaS Service Ping runs versus the multi-tenant Denomas.com.

Service Ping Use Cases

Service Ping metrics provide insights that help our Product, Support, and Sales teams understand how Denomas is used. For example, the data helps to:

Support Denomas xMAU KPIs KPI
Calculate Stage Monthly Active Users (SMAU), which helps to measure the success of our stages and features
Understand which features are or are not being used and provide guidance to customers to help them take advantage of Denomas’ rich feature set
Compare counts month over month (or week over week) to get a rough sense for how an instance uses different product features
Collect other facts that help us classify and understand Denomas installations

Self-Managed Service Ping

Self-Managed customers setup and run Service Ping to provide analytics for their own deployments (instances) of Denomas. Customers optionally disable Service Ping in which case metrics will not be sent to Denomas. Customers have access to Service Ping data through the console application.

SaaS Service Ping

Denomas.com (or Denomas SaaS) is essentially a Denomas-hosted multi-tenant version of Denomas. A manually generated version of Service Ping (Manual SaaS Service Ping) has been implemented for SaaS and provides analytics coverage of SaaS equivalent to what we achieve with Self-Managed instances.

In the past this was executed on the Denomas.com environment directly, with major sets of problems:

Performance Problems

The process is error-prone and causes resource contention with live customer activity
The process is slow, and individual metric queries regularly fail
The process is implemented to run manually during off-peak hours
The process requires available staffing to manage end-to-end

Data Coverage Deficiency

In addition, Manual SaaS Service Ping is only capable of generating instance-level (the entire site) data and does not meet all the needs of Sales, Customer Success, and others who need more granular data from the namespace level to measure individual Denomas.com customer adoption.

To solve for these two primary sets of problems, the Data Team developed “Automated SaaS Service Ping”. Automated SaaS Service Ping which is a set of programs that run natively in the Enterprise Data Platform, a system designed for Big Data, Automation, and Scale. With Automated SaaS Service Ping fully operationalized, Manual SaaS Service Ping is decommissioned on 2023-02-19. There are two major sub-programs that make up Automated SaaS Service Ping:

SaaS Instance Service Ping - Automated weekly generation of Service Ping for the Denomas.com instance
SaaS Instance Redis Ping - Automated weekly generation of Service Ping for the Denomas.com instance from Redis metrics ping
SaaS Namespace Service Ping - Automated weekly generation of Service Ping for every Denomas.com instance->namespace

4 Types of Service Ping Processes Run Versus 3 Environments

In total, there are 4 types of Service Ping either in production or development:

Criteria	1. Self-Managed Service Ping	2. ~~Manual SaaS Service Ping~~ (Decomissioned)	3. Automated SaaS Instance Service Ping	4. Automated SaaS Namespace Service Ping
Where Run	Environment 1: A Customer’s Self-Managed Instance	Environment 2: Within Denomas.com Infrastructure	Environment 3: Data Platform Infrastructure	Environment 3: Data Platform Infrastructure
Run By	Denomas (Automatically)	Analytics Instrumentation (Manually)	Airflow (Automatically)	Airflow (Automatically)
Frequency	Weekly	Weekly	Weekly	Weekly
Code Owner	Analytics Instrumentation	Analytics Instrumentation	Data Team	Data Team
Source Code	`Ruby`, `SQL`	`Ruby`, `SQL`	`Python`, `SQL`, `dbt`	`Python`, `SQL`, `dbt`
Data Granularity	Instance	Instance	Instance	Namespace
Status	`Live-Production`	`Decomissioned`	`Live-Production`	`Live-Production`

Automated SaaS Service Ping Implementation

TL;DR:

The Automated SaaS Instance Service Ping process is more stable and faster, with SQL Based metrics taking approx. 30 minutes to complete and Redis Metrics taking 1 minute. However, the lack of change data capture in the Postgres Pipeline (PGP) 1.0 towards Snowflake results in less accurate data, especially with all-time metrics where the errors resulting from a lack of change data capture accumulate overtime.

In the final analysis, the Automated Process provides more stable data, faster; however, at a lower accuracy rate of the metric counts, in particular the all-time metrics.

Process Overview

graph LR
subgraph Automated Service Ping
AUTO[Automated Service Ping]
AUTO --> SQL[Instance SQL metrics]
AUTO --> REDIS[Instance Redis metrics]
AUTO --> NAMESPACE[Instance Namespace Metrics]
end

Automated SaaS Service Ping is a collection of Python programs and dbt processes orchestrated with Airflow and scheduled to run weekly within the Enterprise Data Platform. The Automated SaaS Service Ping Project stores all source code and configuration files. The program relies on two primary data sources: Redis-based counters and SQL-based PostgreSQL tables. Both sources are implemented as automated data pipelines into Snowflake, intended to run independently of the SaaS Service Ping implementation process.

SQL-Based Postgres data from SaaS is synced via pgp and made available in the RAW.SAAS_USAGE_PING schema
Redis data is accessed at program runtime and also stored in the RAW.SAAS_USAGE_PING schema

Automated SaaS Service Ping consists of two major data processing phases.

Phase 1 is gathering and generating the metrics as defined in the Metric Dictionary
Phase 2 is transforming the metrics into Trusted Data Model (FCT and DIM tables) format.

Phase 1: Metrics Gathering and Generation

SaaS Instance Service Ping

SaaS Instance Service Ping runs as described in the Process Overview.

`Instance SQL metrics` data flow

graph TD
subgraph Postgres SQL-sourced Instance Level Metrics
B[1: Gather latest Metrics Queries via API] --> C[2: Transform PG-SQL to Snowflake-SQL]
C --> D[3: Run S-SQL versus Snowflake's Denomas.com clone]
D --> E[4: Preserve SQL metrics results - to combine with Instance Redis Metrics later]
D --> F[5: ON ERROR store error information in Snowflake RAW INSTANCE_SQL_ERRORS]
end

For more details on the data flow, check Service ping README.md file.

`Instance Redis metrics` based data flow

graph LR
subgraph Redis-sourced Instance Level Metrics
B[1: Gather Metrics Values via API] --> C[2: Preserve Redis metrics results - to combine with Instance SQL Metrics later]
end

Combined base data flow

graph LR
subgraph Combining Metrics
A[1.1: SQL metrics result] --- C[ ]:::empty
B[1.2: Redis metrics result] --- C
C --> D[2: Combine results]
D --> E[3: Store in Snowflake RAW COMBINED_METRICS]
classDef empty width:0px,height:0px;
end

`Instance Namespace metrics` based data flow

SaaS Namespace Service Ping produces metrics at a finer-level of granularity than SaaS Instance Service Ping. The process accesses a list of all namespaces in Denomas.com and loops through each namespace to generate ultimate-parent namespace-level usage metrics. The namespaces table provides input to the program and for efficiency, a SQL set operation is used with SQL GROUP BY namespace-id instead of a traditional 1-by-1 namespace loop. Final metrics output is stored at the ultimate parent namespace level. A drawback with Namespace Service Ping is that only SQL-sourced metrics are currently available and Redis-sourced metrics such as analytics_unique_visits.g_analytics_contribution are currently unavailable.

graph TD
subgraph Postgres SQL-sourced Namespace Level Metrics
B[1: Load namespace-level Snowflake-SQL]
B --> C[2: Run Snowflake-SQL versus Snowflake's Denomas.com clone]
C --> D[3: Store metrics results in Snowflake RAW GITLAB_DOTCOM_NAMESPACE]
end

`Instance Namespace metrics` backfill

This action can be done (primarily) by AE and DE can support it as well, if needed. The main idea is to have a simple and flexible way to backfill namespace data in a cost-effective fashion.

In order to backfill a namespace particular metric (or more of them), you need to properly configure NAMESPACE_BACKFILL_VAR in the Airflow Admin -> Variables tab.

The example how variable NAMESPACE_BACKFILL_VAR should look like:

1
2
3
4
5


{
 "start_date": "2022-10-01",
 "end_date": "2022-11-25",
 "metrics_backfill": "['usage_activity_by_stage_monthly.manage.project_imports.git','usage_activity_by_stage_monthly.manage.groups_with_event_streaming_destinations','usage_activity_by_stage_monthly.manage.audit_event_destinations','counts.boards']"
}

Note: Pay attention that the NAMESPACE_BACKFILL_VAR should be in one row, as the Airflow can raise an error if it is multi-line formatted.

Change the metrics definition

In case you want to change the metrics type from database to redis in the metrics definition file, do the following steps:

Got to metrics definition page
Find the metrics you want to change
Press the link, and you will be redirected to the .yaml file
Press the edit/edit in Web IDE button and change data_source: database to data_source: redis
Put a proper comment and check the option to create a new MR, here is an example of how the merge request should look like
Assign your new MR to a PI team member

The entire process can is explained in the recorded sesion

Note: this backfill DAG will not load all-time metrics even if you define it (will be skipped), as it will not produce an accurate result.

The saas_usage_ping_backfill DAG will backfill data for the metrics where the following conditions are applied:

For the defined period - start_date and end_date value from NAMESPACE_BACKFILL_VAR variable
"time_window_query": true
Metrics is in the value metrics_backfill in the variable NAMESPACE_BACKFILL_VAR

Note: the DAG should have status ON. By default, it is in the OFF status, as it is not scheduled. After run, turn off the DAG. Even if it is turned off, no harm, as the DAG will run one time only.

Metrics Gathering and Generation Process Pseudo-code

Assume the Denomas.com Postgres source data pipelines are running and fresh up-to-date data is available in Snowflake in RAW.SAAS_USAGE_PING and PREP.SAAS_USAGE_PING schemas, respectively
Begin Service Ping python program
1. Start SQL-based metrics generation
  1. Start Postgres SQL-sourced Metrics
    1. Grab the latest set of Postgres SQL-sourced (PG-SQL) metric queries from the Metrics Dictionary API Query Endpoint
    2. Transform Instance-Level PG-SQL to Snowflake SQL (S-SQL) using the python transformer
    3. Run S-SQL versus the SaaS Denomas.com clone data available in the Snowflake Data Warehouse and store the results in RAW.SAAS_USAGE_PING.INSTANCE_COMBINED_METRICS. In case of error, data will land in RAW.SAAS_USAGE_PING.INSTANCE_SQL_ERRORS table.
  2. Start Redis-based Metrics
    1. Call the Redis API
    2. Data is picked up and stored in a JSON format, the approximate size is around 2k lines, usually one file per load (at the moment, it is a weekly load) and the data is stored in RAW.SAAS_USAGE_PING.INSTANCE_COMBINED_METRICS
2. Start Namespace-Level metrics generation
  1. Grab the latest metrics queries from the Namespace Queries JSON
  2. Run the Namespace Queries versus the SaaS Denomas.com clone data available in the Snowflake Data Warehouse and store the results in RAW.SAAS_USAGE_PING.GITLAB_DOTCOM_NAMESPACE

Running Automated Service Ping Ad Hoc to Validate Bug Fixes

We may occasionally encounter scenarios where bugs in Service Ping impact recorded metric values. In such cases, the Data Team can expedite completion of the bug fix by running Automated Service Ping on an ad hoc basis. We will only do this when the following conditions are met:

Bugfix has been tested and validated locally by Analytics Instrumentation
All relevant automated tests in the source system have been updated to prevent similar bugs from getting pushed to production in the future
We have maximum confidence that the bugfix works (running Automated Service Ping is for confirmation on production, not for testing)

Backdating Service Pings to Compensate for Bugs

In cases where Service Ping metric values are impacted by a bug, it is possible to duplicate and backdate a correct Service Ping record to approximate what a correct Service Ping payload would have been. The only time we would consider this is if the Service Ping bug materially impacts XMAU reporting or any other company-wide metric and there are no better alternatives. An example case can be found in this issue. In this case, RedisHLL metrics were dramatically underreporting actual values at the end of March 2023, causing XMAU reporting to be wildly incorrect. For some metrics, the metric values were down 70% from the previous period (we would generally expect to see a small deviation from previous period values).

In this case, the recommended response was to duplicate a 2023-04-03 Automated SaaS Instance Service Ping and change the ping_date to 2023-03-31 so that this ping would be programatically selected for March 2023 XMAU reporting. While this is not an ideal approach, it was deemed to be the best available option because it provided mostly accurate data that would seamlessly flow through the data models, prevented custom SQL or bespoke logic in BI tools, and required a small amount of effort from the Data Team. To quarantine bad (original) ping from 2023-03-31 and avoid wrong data flow in downstream models, DE will update the following columns to isolate incorrect ping:

PING_DATE=2023-03-31
PING_SOURCE=‘wrong’
_UPLOADED_AT = 1680296000 - 2023-03-31 Generally, the date and time same as the PING_DATE column (just exposed in epoch time format).

Other active solutions were more complex and did not achieve substantially higher data quality, and doing nothing would propagate incorrect data in reporting models throughout the EDM.

In the long run, the target state is to quarantine any incorrect data and prevent it from reaching production in the first place. Unfortunately, we do not currently have sophisticated data contracts enforced in the pipeline that would prevent the wrong data from reaching production. Until we have that, we may be forced to pursue non-ideal options such as this one.

Back dating events

Date	Reason	Actions taken	Detailed Denomas Issue
2023-04-05	Fixing March 2023 Redis HLL metric bug	DEs duplicate and backdate later ping; quarantine incorrect ping	link
2023-08-03	Fixing July 2023 `instance_user_count` bug	DEs duplicate and backdate later ping	link

Phase 2: Metrics transformation to Trusted Data Model

Once all of the source metrics are available in Snowflake RAW.SAAS_USAGE_PING schema, we begin dbt processing to transform the data into the Trusted Data Model format.

Known Limitations/Improvements

Namespace-level Redis-source Metrics are not yet available
Snowflake has redundant “legacy” service-ping processes and these need to be deprecated

Service Ping Metrics Types

Within Service Ping, there are 2 main types of metrics supported:

SQL metrics: metrics sourced from Postgres tables
Redis metrics: metrics sourced from Redis based counters

SQL Metrics Implementation

The SQL-based metrics workflow is the most complicated flow. SQL-based metrics are actually created by a SQL query run against the Postgres SQL database of the instance. For large tables, these queries can be very long to run. An example is for example the counts.ci_builds metric which is running a COUNT(*) on the ci_builds which is one of our largest (see dbt table containing more than 1 billion rows). The goal of this module will be to run against our Snowflake database instead of the postgres SQL database of our SaaS Instance.

The Analytics Instrumentation team has created an API endpoint that enables us to retrieve all the SQL queries to run to calculate the metrics. Here is an example file.

A technical documentation about the API endpoint is available on the export service ping sql queries page.

Let’s take a look at a few queries received in the JSON response:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11


 "counts": {
    "assignee_lists": "SELECT COUNT("lists"."id") FROM "lists" WHERE "lists"."list_type" = 3",
    "boards": "SELECT COUNT("boards"."id") FROM "boards"",
    "ci_builds": "SELECT COUNT("ci_builds"."id") FROM "ci_builds" WHERE "ci_builds"."type" = 'Ci::Build'",
    "ci_internal_pipelines": "SELECT COUNT("ci_pipelines"."id") FROM "ci_pipelines" WHERE ("ci_pipelines"."source" IN (1, 2, 3, 4, 5, 7, 8, 9, 10, 11, 12, 13) OR "ci_pipelines"."source" IS NULL)",
    "ci_external_pipelines": "SELECT COUNT("ci_pipelines"."id") FROM "ci_pipelines" WHERE "ci_pipelines"."source" = 6",
    "ci_pipeline_config_auto_devops": "SELECT COUNT("ci_pipelines"."id") FROM "ci_pipelines" WHERE "ci_pipelines"."config_source" = 2",
    "ci_pipeline_config_repository": "SELECT COUNT("ci_pipelines"."id") FROM "ci_pipelines" WHERE "ci_pipelines"."config_source" = 1",
    "ci_runners": "SELECT COUNT("ci_runners"."id") FROM "ci_runners"",
    "ci_triggers": "SELECT COUNT("ci_triggers"."id") FROM "ci_triggers""
    ...

So the goal would be to be able to run them against tables in Snowflake (synced from Denomas Saas). We need to do so, to have tables that have the same column names and the same granularity as the ones in the Postgres SQL tables.

Here below, you see the way we currently transform the Postgres data in Snowflake:

ALT

As highlighted here, we created a dedupe layer that is exactly meeting this criteria. dedupe layer is handy mechanism in order to avoid duplicates and provide accurate data without any compromise.

We have then identified the tables against which we can run the SQL-based metrics queries. We will then need to transform the SQL statements to query against these tables.

We have a script running that transforms this SQL statements :

1
2
3


SELECT COUNT("ci_builds"."id")
   FROM "ci_builds"
  WHERE "ci_builds"."type" = 'Ci::Build'

to this SQL statement:

1
2
3
4
5


SELECT 'counts.ci_builds'    AS counter_name,
       COUNT(ci_builds.id)   AS counter_value,
       TO_DATE(CURRENT_DATE) AS run_day
  FROM prep.gitlab_dotcom.gitlab_dotcom_ci_builds_dedupe_source AS ci_builds
 WHERE ci_builds.type = 'Ci::Build'

We then run all these queries and store the results in a json that we send them to the table called RAW.SAAS_USAGE_PING.INSTANCE_COMBINED_METRICS. This table has the following columns:

query_map: that stores all the queries run
run_results: that stores the results returns
ping_date: date when the query got run
run_id: unique identifier of the processing

For any error appeared, data is saved into RAW.SAAS_USAGE_PING.INSTANCE_SQL_ERRORS table. This table keeps data in json data type. On the top of the table, mechanism is build to incorporate any error in the Trusted Data Model and easily make any malfunction in the SQL processing visible.

Error handling for SQL based service ping

For error handling solution, refer to runnbooks Automated Service Ping article.

New metrics checks

The Data team created an app and delivered it to the PI team as they are able to autonomously use it when any of the metrics are changed (SELECT statement) or added. This will prevent the issue with failing the pipeline in production, the common error is related to executing SQL on a non-existing object in the Snowflake database. The example issue should be found in the issue 2022-09-28 dbt-test Failure in test instance_sql_error_rowcount. Here is the idea of how it is working:

graph TD
  C_API --> API
  F_R --API Response--> F_RES["{code: 200, body: '...'}"]
  IF--No-->PING_DT

  F_RES--Send response--> IF
  subgraph "Product intelligence team"
      NM[New metric introduced]--Generate query-->PI[Postgres query]
      PI--Put query/ies as a parameter-->C_API[Call RESTful API]
      IF{IF response_code==200}
      RD[Ready to deploy metrics change]
  end

  subgraph "Data team application"

      API{{Accept request}}--Transform query-->Q_T[Postgres->Snowflake SQL transformation]
      Q_T[Execute query]--Snowflake syntax-->SF_E[(Snowflake)]
      SF_E-->G_R{Accept and check result}
      G_R--Success-->G_R_Y["{code: 200, body: '...'}"]
      G_R--Error-->G_R_N["{code: 400, body: '...'}"]
      G_R_Y-->F_R[Send response to client];
      G_R_N-->F_R;
      IF--Yes-->RD
      PING_DT[Ping Data Team to analyze the error and add an object into Snowflake]
  end

More details on app with the source code should be found in the service-ping-metrics-check repo.

Redis Metrics Implementation

Redis counters are used to record high-frequency occurrences of some arbitrary situation happening in Denomas, that do not create a permanent record in our Database, for example when a user folds or unfolds the side bar. In such cases, the backend engineer will define a name that would represent a given situation, for example navigation_sidebar_opened, and also arbitrarily decide on the moment (by adding dedicated piece of code in existing execution path) when it happens.

The Analytics Instrumentation team has created an API endpoint that allows the Data Team to retrieve all Redis metrics value at any time we want. An example of the JSON Response is available on the page UsageDataNonSqlMetrics API. Note that -3 means that the metrics is not redis so the API doesn’t retrieve any value for it. Once the JSON response received, we store it in Snowflake in the table RAW.SAAS_USAGE_PING.INSTANCE_COMBINED_METRICS. Additional technical documentation about the API endpoint is available here. The table has the following columns:

jsontext: that stores all the queries run
ping_date: date when the query got run
run_id: unique identifier of the processing

Performance improvements

As a crucial advantage for service ping automation worth mentioning significant performance improvement for Instance SQL metrics and Instance Namespace metrics. Graphical comparison is exposed in the picture below.

Instance SQL metrics improvements

Instance Namespace metrics improvements

Airflow setup

The Airflow DAG saas-instance-usage-ping run every Monday 0700 UTC that executes all the operations described below:

fetching the queries from the API Endpoint
transforming the queries to be able to run them against Snowflake dedupe layer
run the queries
store the results in Snowflake

From RAW to PROD database and Sisense

We currently do a limited transformation once the data is stored in RAW. In the future, the data flow will look like that:

That means the data set created will be UNIONED with the current data pipeline in the model prep_usage_data_flattened.

Reconciliation process - Manual VS Automated Service ping

We might want to add some content and details to this page regarding the known issues that we came across during the reconciliation process:

Full sync is currently not enabled for most of the Postgres db tables that read data from Denomas.com using PG replica process as Denomas.com production database does not store deleted records and only holds incremental data.

Related Issues:

Explore option to switch table load for full load (PG Replica)
Missing metrics from Automated process when compared to Manual process, this is due to the reasons as listed in this Issue.
The Geo related metrics are missing from Automated process as they are not enabled for Denomas.com
We do not have a 100% match in terms of metric values for metrics in Automated process when compared to Manual process, and these variances could be due to various factors - run times of the process, History data not available in the new process etc..

Variance Frequency %s for 28d, all time and 7d is available below along with a detailed comparison of the metrics from Automated and manual process for different Ping dates:

There are some broken queries that the PI team is working on fixing that directly impact the metrics values in the Automated process [gitlab-org&5158, gitlab-org&7451].

Useful links

Last modified December 6, 2023: update (3d741be9)

View page source - Edit this page - please contribute.