You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
GOAL: find the id, kind with most occurrences, and then count occurrences per 20 second time group. I've added 4 variations on the same query here, a) recursive CTE, b) two separate queries, c) query using an alias for the summary, d) similar to (c) but removes the UNION ALL. Only query (b) makes use of the primary key for the time series query.
The recursive CTE scans 150.02k (where is the 0.02k from?) rows, multi step query scans 50k+8.19k rows. Aliased query scans 150k rows. Could recursive CTE support optimizing the recursion step when primary key fields are included in the where? Or is there a better way I could handle this query pattern?
For the real world use case, we are using these style queries (currently (b)) for performing analytics on opentelemetry traces, and logs. The multi query pattern has some additional benefits where we can infer primary key fields used by grouping them during the summary query, and then filtering them down in the timeseries query.
CREATETABLEevents (
time DateTime,
id String,
kind String
)
ENGINE=MergeTree
ORDER BY (kind, id, toUnixTimestamp(time));
INSERT INTO events
SELECT now() - toIntervalSecond(mod(number, 300)),
concat('item ', randPoisson(10)),
concat('kind ', randPoisson(10))
FROM numbers(50000);
-- expressed as recursive cte
WITH RECURSIVE analytics AS (
SELECT id,
kind,
count() AS c,
CAST(NULL, 'Nullable(DateTime)') AStimestampFROM events
GROUP BY id, kind
ORDER BY c DESCLIMIT1UNION ALLSELECT id,
kind,
count() AS c,
toStartOfInterval(time, toIntervalSecond(20)) AStimestampFROM events AS t, analytics AS s
WHEREs.id=t.idANDs.timestamp IS NULLANDs.kind=t.kindGROUP BYtimestamp, id, kind
)
SELECT*FROM analytics
ORDER BYtimestampDESC, c DESC
┌─id──────┬─kind────┬────c─┬───────────timestamp─┐
1. │ item 10 │ kind 10 │ 21 │ 2024-05-1813:26:40 │
2. │ item 10 │ kind 10 │ 427 │ 2024-05-1813:26:20 │
3. │ item 10 │ kind 10 │ 1931 │ 2024-05-1813:26:00 │
4. │ item 10 │ kind 10 │ 2644 │ 2024-05-1813:25:40 │
5. │ item 10 │ kind 10 │ 1158 │ 2024-05-1813:25:20 │
6. │ item 10 │ kind 10 │ 150 │ 2024-05-1813:25:00 │
7. │ item 10 │ kind 10 │ 9 │ 2024-05-1813:24:40 │
8. │ item 10 │ kind 10 │ 6340 │ ᴺᵁᴸᴸ │
└─────────┴─────────┴──────┴─────────────────────┘
8 rows inset. Elapsed: 0.026 sec. Processed 150.02 thousand rows, 5.06 MB (5.87 million rows/s., 198.05 MB/s.)
Peak memory usage: 503.98 KiB.
-- 150k rows processed, or 3 full table scans-- expressed as two queries, one for summary, one for results-- step 1, summary querySELECT
id,
kind,
count() AS c,
CAST(NULL, 'Nullable(DateTime)') AStimestampFROM events
GROUP BY
id,
kind
ORDER BY c DESCLIMIT1
Query id: 41e30428-d229-4496-8be8-63cce5c84561
┌─id──────┬─kind────┬────c─┬─timestamp─┐
1. │ item 10 │ kind 10 │ 6340 │ ᴺᵁᴸᴸ │
└─────────┴─────────┴──────┴───────────┘
1 row inset. Elapsed: 0.009 sec. Processed 50.00 thousand rows, 1.55 MB (5.81 million rows/s., 180.59 MB/s.)
Peak memory usage: 1.13 KiB.
-- step 2, timeseries querySELECT
id,
kind,
count() AS c,
toStartOfInterval(time, toIntervalSecond(20)) AStimestampFROM events AS t
WHERE (id ='item 10') AND (kind ='kind 10')
GROUP BYtimestamp,
id,
kind
Query id: 9d184301-8da9-49e8-aa1c-79daf1763abe
┌─id──────┬─kind────┬────c─┬───────────timestamp─┐
1. │ item 10 │ kind 10 │ 9 │ 2024-05-1813:24:40 │
2. │ item 10 │ kind 10 │ 1158 │ 2024-05-1813:25:20 │
3. │ item 10 │ kind 10 │ 2644 │ 2024-05-1813:25:40 │
4. │ item 10 │ kind 10 │ 150 │ 2024-05-1813:25:00 │
5. │ item 10 │ kind 10 │ 1931 │ 2024-05-1813:26:00 │
6. │ item 10 │ kind 10 │ 427 │ 2024-05-1813:26:20 │
7. │ item 10 │ kind 10 │ 21 │ 2024-05-1813:26:40 │
└─────────┴─────────┴──────┴─────────────────────┘
7 rows inset. Elapsed: 0.005 sec. Processed 8.19 thousand rows, 294.87 KB (1.72 million rows/s., 61.78 MB/s.)
Peak memory usage: 1.34 KiB.
-- expressed as single query using summary alias. with UNION ALL = 150k rows processed, without = 100k rows processed
WITH summary AS
(
SELECT
id,
kind,
count() AS c,
CAST(NULL, 'Nullable(DateTime)') AS ts
FROM events
GROUP BY
id,
kind
ORDER BY c DESCLIMIT1
)
SELECT
id,
kind,
count() AS c,
toStartOfInterval(time, toIntervalSecond(20)) AStimestampFROM events AS t, summary AS s
WHERE (t.id=s.id) AND (s.ts IS NULL) AND (s.kind=t.kind)
GROUP BYtimestamp,
id,
kind
UNION ALLSELECT*FROM summary
Query id: 7a630262-e310-4695-9af5-dab99f1c9834
┌─id──────┬─kind────┬────c─┬─timestamp─┐
1. │ item 10 │ kind 10 │ 6340 │ ᴺᵁᴸᴸ │
└─────────┴─────────┴──────┴───────────┘
┌─id──────┬─kind────┬────c─┬───────────timestamp─┐
2. │ item 10 │ kind 10 │ 9 │ 2024-05-1813:24:40 │
3. │ item 10 │ kind 10 │ 1158 │ 2024-05-1813:25:20 │
4. │ item 10 │ kind 10 │ 2644 │ 2024-05-1813:25:40 │
5. │ item 10 │ kind 10 │ 150 │ 2024-05-1813:25:00 │
6. │ item 10 │ kind 10 │ 1931 │ 2024-05-1813:26:00 │
7. │ item 10 │ kind 10 │ 427 │ 2024-05-1813:26:20 │
8. │ item 10 │ kind 10 │ 21 │ 2024-05-1813:26:40 │
└─────────┴─────────┴──────┴─────────────────────┘
8 rows inset. Elapsed: 0.019 sec. Processed 150.00 thousand rows, 4.86 MB (7.86 million rows/s., 254.81 MB/s.)
Peak memory usage: 3.12 KiB.
-- 4th variation, use anyIf to get result of summary query without UNION ALL
WITH summary AS
(
SELECT
id,
kind,
count() AS c,
CAST(NULL, 'Nullable(DateTime)') AS ts
FROM events
GROUP BY
id,
kind
ORDER BY c DESCLIMIT1
)
SELECT
id,
kind,
count() AS c,
toStartOfInterval(time, toIntervalSecond(20)) AStimestamp,
anyIf(s.c, (s.id=t.id) AND (s.kind=t.kind))
FROM events AS t, summary AS s
WHERE (t.id=s.id) AND (s.ts IS NULL) AND (s.kind=t.kind)
GROUP BYtimestamp,
id,
kind
Query id: ccbb0d55-b901-48ff-9e69-21bdd6b3ef14
┌─id──────┬─kind────┬────c─┬───────────timestamp─┬─anyIf(s.c, and(equals(s.id, id), equals(s.kind, kind)))─┐
1. │ item 10 │ kind 10 │ 222 │ 2024-05-1814:35:40 │ 6240 │
2. │ item 10 │ kind 10 │ 12 │ 2024-05-1814:37:20 │ 6240 │
3. │ item 10 │ kind 10 │ 1417 │ 2024-05-1814:36:00 │ 6240 │
4. │ item 10 │ kind 10 │ 275 │ 2024-05-1814:37:00 │ 6240 │
5. │ item 10 │ kind 10 │ 2787 │ 2024-05-1814:36:20 │ 6240 │
6. │ item 10 │ kind 10 │ 1517 │ 2024-05-1814:36:40 │ 6240 │
7. │ item 10 │ kind 10 │ 10 │ 2024-05-1814:35:20 │ 6240 │
└─────────┴─────────┴──────┴─────────────────────┴─────────────────────────────────────────────────────────┘
7 rows inset. Elapsed: 0.018 sec. Processed 100.00 thousand rows, 3.31 MB (5.48 million rows/s., 181.16 MB/s.)
Peak memory usage: 4.30 MiB.
The text was updated successfully, but these errors were encountered:
GOAL: find the id, kind with most occurrences, and then count occurrences per 20 second time group. I've added 4 variations on the same query here, a) recursive CTE, b) two separate queries, c) query using an alias for the summary, d) similar to (c) but removes the
UNION ALL
. Only query (b) makes use of the primary key for the time series query.The recursive CTE scans 150.02k (where is the 0.02k from?) rows, multi step query scans 50k+8.19k rows. Aliased query scans 150k rows. Could recursive CTE support optimizing the recursion step when primary key fields are included in the where? Or is there a better way I could handle this query pattern?
For the real world use case, we are using these style queries (currently (b)) for performing analytics on opentelemetry traces, and logs. The multi query pattern has some additional benefits where we can infer primary key fields used by grouping them during the summary query, and then filtering them down in the timeseries query.
The text was updated successfully, but these errors were encountered: