Self-Managed Metric Estimation Algorithm
Introduction to Service Ping estimation
Self-Managed data arrives from Service Ping which returns aggregated counts of user activity and other usage metrics for analysis. The Service Ping data we receive is incomplete as there are 3 types of installations that we need to estimate for:
- Installations that opt out of sending us Service Ping
- Installations that opt to send a subset of Service Ping metrics (i.e. operational metrics)
- This applies to any metric where
category = 'optional'
- This applies to any metric where
- Installations on older versions without the metric instrumented
- Ex: If a metric is instrumented on 13.3, we need to estimate usage for installations reporting on versions < 13.3
This leads to only a subset of the data being reported. Some data will be partial as we know there is an installation but we did not receive all the metrics we report on (whether due to version or metric opt-out). Some installations will never be seen and we cannot know how many of these exist as they opted-out of reporting Service Ping completely. The only moment we can identify these installations is when they create a subscription (not all installations make a subscription) and at this moment we can identify there is a self-managed installation even if it does not send ping. The subscription can choose whether it would like to report Service Ping or not.
To attempt to report the total MAU we then need to apply estimates on top of metrics and the aggregate to address the points above.
The ultimate goal is to create estimation methodologies that meet the following criteria:
- They are logically sound (i.e., we think they are reasonable for estimating usage)
- They produce reasonable results (ex: they do not produce extremely volatile metrics, which seem unlikely to reflect reality)
- They are easy to explain and understand
- They are auditable
If those criteria are not met, it is challenging to have trust in the integrity of the estimation and accept it as a viable solution for KPI reporting and analysis. The goal of this page is to enable greater understanding of why we generate estimations, who we generate them for, and what our current methodologies are. For the sake of transparency and to enable self-service exploration, the mechanisms and details of how and where we generate components of our methodologies are included throughout the page.
The current methodology (‘metric/version check - subscription based estimation’)
Total Ping with Estimate = (Recorded Ping)/(Percent Reporting)
To estimate the missing proportion of Ping metric value we currently calculate the percentage of reporting installations which is measured as:
Percent Reporting = (Number of Ping Reporting with Correct Version)/(Total number of Subscriptions)
Below is a table with fake data to represent a theoretical collection of Ping data to illustrate how it is estimated. Note that the “Actual Usage” is not data we really have but meant to illustrate a theoretical target and show which types of installations we would have Ping data for. There are two rows I would like to draw immediate attention to. The first row shows SaaS data and it will not be included in the estimation as we have complete reporting on it and so it is not required. The final row shows the installations that have opted-out of Ping, are self-managed, and do not have a subscription. These are intended to show the unknown number of potential users and activity that exists in this group as we have no way of knowing their activity.
| Installation Type | Number of Installations | Subscription? | Ping Opt-In | Version Instrumented | Metric Opted In | Actual Usage | Reported Value |
|---|---|---|---|---|---|---|---|
| SaaS | 1 | Yes | Yes | Yes | Yes | 30 | 30 |
| Right Version Opted in | 60 | Yes | Yes | Yes | Yes | 80 | 80 |
| Wrong Version Installations | 30 | Yes | Yes | No | Yes | 30 | NULL |
| Metric Opted Out | 15 | Yes | Yes | Yes | No | 20 | NULL |
| No Ping Opt-in | 9 | Yes | No | Yes | No | 10 | NULL |
| No subscription, Ping Reporting | 20 | No | Yes | Yes | Yes | 30 | 20 |
| No Ping, No data | ? | No | No | Yes | No | ? | NULL |
If we add up the subscriptions as installations on the correct version and opted-in for this Ping metric we see 80 are reporting. Adding up the total subscriptions we get 134. This gives us a percentage reporting estimate of 59.7%.
To estimate the total metric value for this we take the reported value of 100 and divide it by 59.7% for an estimated total of 168.
Visual representation of Ping estimation example
Available methodologies
Several different methodologies are available in the reporting models (ex: rpt_ping_metric_totals_w_estimates_monthly).
We use the field estimation_grain to differentiate the methodologies.
Right now, there are 2 components to each estimation_grain, a condition and a measure.
The “condition” relates to the actual reporting from Service Ping, the “measure” is the value
we will use to generate the estimate.
| Condition | Measure | estimation_grain |
|---|---|---|
| On version with metric | Subscription count | metric/version check - subscription based estimation* |
| On version with metric | Seat count | metric/version check - seat based estimation |
| Reported metric | Subscription count | reported metric - subscription based estimation |
| Reported metric | Seat count | reported metric - seat based estimation |
*This is the original methodology used in the legacy models.
Knowns and unknowns
What we know
- The number of active subscriptions per month
- The number of active subscriptions that send Denomas service ping data each month
- The release version for each specific metric/counter and edition (CE vs EE) in Service Ping
- Ex: we know that
redis_hll_counters.epics_usage.epics_usage_total_unique_counts_monthlywas released in version 13.7 on CE and 13.8 on EE.
- Ex: we know that
Technical implementation details
- Identifying active subscriptions
- Whether a subscription is “active” within a given month is determined by 1 of 2 things:
- The subscription’s presence in
fct_chargewith a status of eitherActiveorCancelled - The subscription’s presence in
mart_ping_instance_metric_monthly(in other words, subscriptions that sent a ping), even if they were not found infct_charge
- The subscription’s presence in
- Details on active subscriptions at the subscription- and installation-level are surfaced in
rpt_ping_latest_subscriptions_monthly
- Whether a subscription is “active” within a given month is determined by 1 of 2 things:
- Counting active subscriptions that sent Service Ping
- This is captured by a distinct count of subscriptions from
mart_ping_instance_metric_monthly
- This is captured by a distinct count of subscriptions from
- Determining when a metric was instrumented
- This is determined by finding the MIN and MAX version where that metric exists on a given
edition within
mart_ping_instance_metric_monthly - This data is surfaced in
rpt_ping_metric_first_last_versions
- This is determined by finding the MIN and MAX version where that metric exists on a given
edition within
What we do not know
- Number of active free installations
- If a free installation does not send us Service Ping, we have no way of knowing it exists
Since we do not have visibility into the entire scope of the free Self-Managed universe, we need to rely on the behavior of paid installations to generate estimates.
Assumptions
Given the knowns and unknowns, our current methodologies all make the following assumptions:
- Installations that do not report metrics behave the same as installations that send Service Ping
- This means that we assume that free and paid installations act the same, that paid installations that opt into sending Service Ping act the same as those who do not, etc
- Installations that do not report metrics adopt versions at the same rate as installations
that send Service Ping
- All editions and tiers are assumed to have the same version adoption rate
- Free installations opt into sending Service Ping at the same rate as paid installations
In the future, we hope to iterate on these methodologies in order to address some of the issues with these assumptions.
Methodology conditions
We currently have 2 distinct conditions for our different methodologies:
- Reporting on a version with the metric
- Reporting the metric
On version with metric (metric/version check)
estimation_grain = 'metric/version check - subscription based estimation'- First version of estimation used in legacy models
estimation_grain = 'metric/version check - seat based estimation'
Total xMAU (with estimate) = Recorded xMAU / % subscription count or seats on a version with the metric
This condition leverages the ratio of subscriptions or seats (the measure) that report on a version with the metric instrumented. Since metrics are released on different Denomas versions across editions, this ratio is customized for each metric and edition. For metrics introduced on the same version, the ratio will be the same. (Ex: the ratio is the same for all metrics instrumented on EE version 13.3).
With this condition comes an additional assumption: all installations on a given
version report all metrics. This does not take into account that some installations only send
operational metrics and opt out of sending
optional metrics. (Note: Not all xMAU metrics are
considered to be operational).
Technical implementation details
Whether a subscription meets this condition is determined in rpt_ping_subscriptions_on_versions_counts_monthly.
Reported metric
estimation_grain = 'reported metric - subscription based estimation'estimation_grain = 'reported metric - seat based estimation'
Total xMAU (with estimate) = Recorded xMAU / % subscription count or seats reporting metric
This condition leverages the ratio of subscriptions or seats (the measure) that report a metric. Since not all installations report all metrics (see distinction above re: operational vs optional metrics), this ratio is customized for each metric.
This condition also adds an assumption: all metrics were instrumented on the same version of CE
and EE. However, we know that metrics are instrumented on different versions. (Ex: redis_hll_counters.epics_usage.epics_usage_total_unique_counts_monthly
was released in version 13.7 on CE and 13.8 on EE)
Technical implementation details
Whether a subscription meets this condition is determined in rpt_ping_subscriptions_reported_estimate_factors_monthly.
Methodology measures
We currently have 2 distinct measures for our different methodologies:
- Subscription count
- Seat count
Subscription count
estimation_grain = 'metric/version check - subscription based estimation'- First version of estimation used in legacy models
estimation_grain = 'reported metric - subscription based estimation'
This measure leverages the actual count of active subscriptions that meet the given condition. This count is inclusive of both paid and unpaid subscriptions (ex: OSS/EDU programs).
This measure carries another assumption: all installations have the same number of users. In other words, an installation with 5 seats is treated/weighted the same as one with 1000 seats.
Technical implementation details
Subscription counts are determined by a subscription’s presence in fct_charge
with a status of either Active or Cancelled, but we also include subscriptions appearing
in mart_ping_instance_metric_monthly
(in other words, subscriptions that sent a ping) that were not found in fct_charge.
This is surfaced in the model rpt_ping_latest_subscriptions_monthly
with the field latest_subsciption_id. In the subsequent models like rpt_ping_subscriptions_reported_estimate_factors_monthly,
a distinct count of latest_subsciption_id is used to determine the number of
subscriptions meeting the condition and the number of potential subscriptions. When determining
the ratio or percentage of subscriptions meeting the condition, we use the pct_w_counters macro,
contains the following calculation:
DIV0({{ reporting_count }},({{ reporting_count }}+{{ no_reporting_count }}))
This is synonymous with
count of subscriptions meeting condition / total subscription count
The output of that macro is the % of subscriptions meeting the condition. We then apply that
ratio to the recorded usage in the usage_estimation macro
to generate the estimated usage. The macro applies the following calculation:
{{ recorded_usage }} + DIV0(({{ recorded_usage }} * (1 - {{ percent_reporting }} )),{{ percent_reporting }} )
This is synonymous with
(recorded usage / % of subscriptions meeting condition) - recorded_usage
The output of that macro is the estimated usage. The recorded, estimated, and total usage values
are surfaced in rpt_ping_metric_totals_w_estimates_monthly,
along with the raw inputs (count of total subscriptions, % of subscriptions meeting condition, etc).
Seat count
estimation_grain = 'metric/version check - seat based estimation'estimation_grain = 'reported metric - seat based estimation'
This measure leverages the count of licensed seats associated with active subscriptions. This count is inclusive of seats associated with both paid and unpaid subscriptions (ex: OSS/EDU programs).
This measure also brings the following assumption: license utilization is the same across all installations. In other words, we assume that the same share of seats is used, regardless of the installation’s size.
Technical implementation details
Seat counts are determined by quantity found within the fct_charge
model. This is surfaced as total_licensed_users in rpt_ping_subscriptions_reported_counts_monthly
and rpt_ping_subscriptions_on_versions_counts_monthly.
When determining the ratio or percentage of seats meeting the condition, we use the pct_w_counters macro,
contains the following calculation:
DIV0({{ reporting_count }},({{ reporting_count }}+{{ no_reporting_count }}))
This is synonymous with
count of seats meeting condition / total seat count
The output of that macro is the % of seats meeting the condition. We then apply that
ratio to the recorded usage in the usage_estimation macro
to generate the estimated usage. The macro applies the following calculation:
{{ recorded_usage }} + DIV0(({{ recorded_usage }} * (1 - {{ percent_reporting }} )),{{ percent_reporting }} )
This is synonymous with
(recorded usage / % of seats meeting condition) - recorded_usage
The output of that macro is the estimated usage. The recorded, estimated, and total usage values
are surfaced in rpt_ping_metric_totals_w_estimates_monthly,
along with the raw inputs (count of total seats, % of seats meeting condition, etc).
Examples
In order to better understand how these methodologies translate into actual calculations, here are some examples covering each of the conditions and measures above. These examples use hypothetical data to facilitate the step-by-step calculations.
In both examples we are generating Total Create SMAU, including the
estimate. The counter used as SMAU for the Create stage is usage_activity_by_stage_monthly.create.action_monthly_active_users_project_repo.
On version with metric x subscription count
To calculate the Estimated SMAU for estimation_grain = 'metric/version check - subscription based estimation',
we follow the following process for a given month:
- Determine the number of active subscriptions during that month
- Active self-managed subscriptions: 5000
- This occurs in
rpt_ping_subscriptions_reported_counts_monthly
- Determine when (on which version) the metric was first introduced
usage_activity_by_stage_monthly.create.action_monthly_active_users_project_repowas released on version 13.3 on both CE and EE- This occurs in
rpt_ping_metric_first_last_versions
- Calculate the % of subscriptions that sent us a service ping on a version with the metric
instrumented (in this case, subscriptions on versions >= 13.3)
- % of subscriptions reporting on a version >= 13.3 = 3000/5000 = 60%
- This occurs in
rpt_ping_subscriptions_on_versions_estimate_factors_monthly
- Calculate recorded SMAU split by delivery (Self-Managed vs SaaS)
- Recorded Self-Managed SMAU: 1M
- SaaS SMAU: 800K
- This occurs in
rpt_ping_metric_totals_w_estimates_monthly
- Calculate Total SMAU (with estimate)
- Total Self-Managed SMAU (with estimate) = Recorded SMAU / % of subscriptions on 13.3+ = 1M / 60% = 1.7M
- Estimated uplift = (Recorded SMAU / % of subscriptions on 13.3+) - Recorded SMAU = 1.7M - 1M = 0.7M
- Total SMAU (with estimate) = 1.0M (Recorded Self-Managed) + 0.7M (Estimated Self-Managed Uplift) + 800K (SaaS) = 2.5M
- This occurs in
rpt_ping_metric_totals_w_estimates_monthly
Reported metric x seat count
To calculate the Estimated SMAU for estimation_grain = 'reported metric - seat based estimation',
we follow the following process for a given month:
- Determine the number of licensed seat tied to active subscriptions during that month
- Active self-managed seats: 1000000
- This occurs in
rpt_ping_latest_subscriptions_monthly(subscription-level) andrpt_ping_subscriptions_reported_counts_monthly(aggregated)
- Calculate the % of seats that reported the metric
- % of seats reporting
usage_activity_by_stage_monthly.create.action_monthly_active_users_project_repo= 530000/1000000 = 53% - This occurs in
rpt_ping_subscriptions_reported_estimate_factors_monthly
- % of seats reporting
- Calculate recorded SMAU split by delivery (Self-Managed vs SaaS)
- Recorded Self-Managed SMAU: 1M
- SaaS SMAU: 800K
- This occurs in
rpt_ping_metric_totals_w_estimates_monthly
- Calculate Total SMAU (with estimate)
- Total Self-Managed SMAU (with estimate) = Recorded SMAU / % of seats reporting
usage_activity_by_stage_monthly.create.action_monthly_active_users_project_repo= 1M / 53% = 1.9M - Estimated uplift = (Recorded SMAU / % of seats reporting
usage_activity_by_stage_monthly.create.action_monthly_active_users_project_repo) - Recorded SMAU = 1.9M - 1M = 0.9M - Total SMAU (with estimate) = 1.0M (Recorded Self-Managed) + 0.9M (Estimated Self-Managed Uplift) + 800K (SaaS) = 2.7M
- This occurs in
rpt_ping_metric_totals_w_estimates_monthly
- Total Self-Managed SMAU (with estimate) = Recorded SMAU / % of seats reporting
Areas of opportunity for iteration
Based on the assumptions called out above, we know that there are opportunities to iterate on the current methodologies. Here are some areas we can explore.
- We could also break down the ratio applied per plan/tier (including free vs paid)
- One current challenge: a subscription can have 2 installations from 2 different product tiers
- We could find a way to layer version adoption rate into the estimation
- We know that paid installations tend to adopt new versions faster than free installations, so this would be a substantial improvement in estimating free/Core usage
- We could add measures based on paid subscriptions and seats associated with paid
subscriptions
- There is a placeholder issue to explore this methodology here
Feedback
Please add feedback to this issue
176cf9ec)
