时间:2021-07-01 10:21:17 帮助过:26人阅读
默认情况下,ClickHouse会限制group by使用的内存量(它使用 hash table来处理group by)。这很容易解决 - 如果你有空闲的内存,增加这个参数:
SET max_memory_usage = 128000000000; #128G,
如果你没有那么多的内存可用,ClickHouse可以通过设置这个“溢出”数据到磁盘:
set max_bytes_before_external_group_by=20000000000; #20G set max_memory_usage=40000000000; #40G
根据文档,如果需要使用max_bytes_before_external_group_by,建议将max_memory_usage设置为max_bytes_before_external_group_by大小的两倍。
(原因是聚合需要分两个阶段进行:1.查询并且建立中间数据 2.合并中间数据。 数据“溢出”到磁盘一般发生在第一个阶段,如果没有发生数据“溢出”,ClickHouse在阶段1和阶段2可能需要相同数量的内存)
性能对比:
ClickHouse 和Spark 都是分布式的; 但是为了方便测试我们只进行单节点测试。 测试结果令人震惊
Size / compression | Spark v. 2.0.2 | ClickHouse |
数据存储格式 | Parquet, compressed: snappy | Internal storage, compressed |
Size (uncompressed: 1.2TB) | 395G | 212G |
Test | Spark v. 2.0.2 | ClickHouse | Diff |
Query 1: count (warm) | 7.37 sec (no disk IO) | 6.61 sec | ~same |
Query 2: simple group (warm) | 792.55 sec (no disk IO) | 37.45 sec | 21x better |
Query 3: complex group by | 2522.9 sec | 398.55 sec | 6.3x better |
附录:
硬件
Query 1
select count(*) from wikistat
ClickHouse:
:) select count(*) from wikistat; SELECT count(*) FROM wikistat ┌─────count()─┐ │ 26935251789 │ └─────────────┘ 1 rows in set. Elapsed: 6.610 sec. Processed 26.88 billion rows, 53.77 GB (4.07 billion rows/s., 8.13 GB/s.)
Spark:
spark-sql> select count(*) from wikistat; 26935251789 Time taken: 7.369 seconds, Fetched 1 row(s)
Query 2
select count(*), month(dt) as mon from wikistat where year(dt)=2008 and month(dt) between 1 and 10 group by month(dt) order by month(dt);
ClickHouse:
:) select count(*), toMonth(date) as mon from wikistat where toYear(date)=2008 and toMonth(date) between 1 and 10 group by mon; SELECT count(*), toMonth(date) AS mon FROM wikistat WHERE (toYear(date) = 2008) AND ((toMonth(date) >= 1) AND (toMonth(date) <= 10)) GROUP BY mon ┌────count()─┬─mon─┐ │ 2100162604 │ 1 │ │ 1969757069 │ 2 │ │ 2081371530 │ 3 │ │ 2156878512 │ 4 │ │ 2476890621 │ 5 │ │ 2526662896 │ 6 │ │ 2489723244 │ 7 │ │ 2480356358 │ 8 │ │ 2522746544 │ 9 │ │ 2614372352 │ 10 │ └────────────┴─────┘ 10 rows in set. Elapsed: 37.450 sec. Processed 23.37 billion rows, 46.74 GB (623.97 million rows/s., 1.25 GB/s.)
Spark:
spark-sql> select count(*), month(dt) as mon from wikistat where year(dt)=2008 and month(dt) between 1 and 10 group by month(dt) order by month(dt); 2100162604 1 1969757069 2 2081371530 3 2156878512 4 2476890621 5 2526662896 6 2489723244 7 2480356358 8 2522746544 9 2614372352 10 Time taken: 792.552 seconds, Fetched 10 row(s)
Query 3
SELECT path, count(*), sum(hits) AS sum_hits, round(sum(hits) / count(*), 2) AS hit_ratio FROM wikistat WHERE project = ‘en‘ GROUP BY path ORDER BY sum_hits DESC LIMIT 100;
ClickHouse:
:) SELECT :-] path, :-] count(*), :-] sum(hits) AS sum_hits, :-] round(sum(hits) / count(*), 2) AS hit_ratio :-] FROM wikistat :-] WHERE (project = ‘en‘) :-] GROUP BY path :-] ORDER BY sum_hits DESC :-] LIMIT 100; SELECT path, count(*), sum(hits) AS sum_hits, round(sum(hits) / count(*), 2) AS hit_ratio FROM wikistat WHERE project = ‘en‘ GROUP BY path ORDER BY sum_hits DESC LIMIT 100 ┌─path────────────────────────────────────────────────┬─count()─┬───sum_hits─┬─hit_ratio─┐ │ Special:Search │ 44795 │ 4544605711 │ 101453.41 │ │ Main_Page │ 31930 │ 2115896977 │ 66266.74 │ │ Special:Random │ 30159 │ 533830534 │ 17700.54 │ │ Wiki │ 10237 │ 40488416 │ 3955.11 │ │ Special:Watchlist │ 38206 │ 37200069 │ 973.67 │ │ YouTube │ 9960 │ 34349804 │ 3448.78 │ │ Special:Randompage │ 8085 │ 28959624 │ 3581.9 │ │ Special:AutoLogin │ 34413 │ 24436845 │ 710.11 │ │ Facebook │ 7153 │ 18263353 │ 2553.24 │ │ Wikipedia │ 23732 │ 17848385 │ 752.08 │ │ Barack_Obama │ 13832 │ 16965775 │ 1226.56 │ │ index.html │ 6658 │ 16921583 │ 2541.54 │ … 100 rows in set. Elapsed: 398.550 sec. Processed 26.88 billion rows, 1.24 TB (67.45 million rows/s., 3.10 GB/s.)
Spark:
spark-sql> SELECT > path, > count(*), > sum(hits) AS sum_hits, > round(sum(hits) / count(*), 2) AS hit_ratio > FROM wikistat > WHERE (project = ‘en‘) > GROUP BY path > ORDER BY sum_hits DESC > LIMIT 100; ... Time taken: 2522.903 seconds, Fetched 100 row(s)
参考:
https://www.percona.com/blog/2017/02/13/clickhouse-new-opensource-columnar-database/
https://clickhouse.yandex/docs/en/single/#ClickHouse%20features%20that%20can%20be%20considered%20disadvantages
ClickHouse开源数据库
标签:intern bytes 格式 rtu 溢出 new 规格 cat max