当前位置：Gxlcms > 数据库问题 > ClickHouse开源数据库

ClickHouse开源数据库

时间：2021-07-01 10:21:17 帮助过：26人阅读

min(toMonth(date)), max(toMonth(date)), path, count(*), sum(hits), sum(hits) / count(*) AS hit_ratio FROM wikistat WHERE (project = ‘en‘) GROUP BY path ORDER BY hit_ratio DESC LIMIT 10 ↖ Progress: 1.83 billion rows, 85.31 GB (68.80 million rows/s., 3.21 GB/s.) ██████████▋ 6%Received exception from server: Code: 241. DB::Exception: Received from localhost:9000, 127.0.0.1. DB::Exception: Memory limit (for query) exceeded: would use 9.31 GiB (attempt to allocate chunk of 1048576 bytes), maximum: 9.31 GiB: (while reading column hits):

默认情况下，ClickHouse会限制group by使用的内存量（它使用 hash table来处理group by）。这很容易解决 - 如果你有空闲的内存，增加这个参数：

SET max_memory_usage = 128000000000; #128G，

如果你没有那么多的内存可用，ClickHouse可以通过设置这个“溢出”数据到磁盘：

set max_bytes_before_external_group_by=20000000000; #20G
set max_memory_usage=40000000000; #40G

根据文档，如果需要使用max_bytes_before_external_group_by，建议将max_memory_usage设置为max_bytes_before_external_group_by大小的两倍。

（原因是聚合需要分两个阶段进行：1.查询并且建立中间数据 2.合并中间数据。数据“溢出”到磁盘一般发生在第一个阶段，如果没有发生数据“溢出”，ClickHouse在阶段1和阶段2可能需要相同数量的内存）

性能对比:

ClickHouse vs. Spark

ClickHouse 和Spark 都是分布式的；但是为了方便测试我们只进行单节点测试。测试结果令人震惊

Size / compression	Spark v. 2.0.2	ClickHouse
数据存储格式	Parquet, compressed: snappy	Internal storage, compressed
Size (uncompressed: 1.2TB)	395G	212G

技术分享图片

Test	*Spark v. 2.0.2*	*ClickHouse*	*Diff*
Query 1: count (warm)	7.37 sec (no disk IO)	6.61 sec	~same
Query 2: simple group (warm)	792.55 sec (no disk IO)	37.45 sec	21x better
Query 3: complex group by	2522.9 sec	398.55 sec	6.3x better

附录：

硬件

CPU: 24xIntel(R) Xeon(R) CPU L5639 @ 2.13GHz (physical = 2, cores = 12, virtual = 24, hyperthreading = yes)
Disk: 2 consumer grade SSD in software RAID 0 (mdraid)

Query 1

select count(*) from wikistat

ClickHouse:

:)  select count(*) from wikistat;
 
SELECT count(*)
FROM wikistat 
 
┌─────count()─┐
│ 26935251789 │
└─────────────┘
 
1 rows in set. Elapsed: 6.610 sec. Processed 26.88 billion rows, 53.77 GB (4.07 billion rows/s., 8.13 GB/s.)

Spark:

spark-sql> select count(*) from wikistat;
26935251789
Time taken: 7.369 seconds, Fetched 1 row(s)

Query 2

select count(*), month(dt) as mon
from wikistat where year(dt)=2008
and month(dt) between 1 and 10
group by month(dt)
order by month(dt);

ClickHouse:

:) select count(*), toMonth(date) as mon from wikistat 
where toYear(date)=2008 and toMonth(date) between 1 and 10 group by mon;
 
SELECT 
    count(*), 
    toMonth(date) AS mon
FROM wikistat 
WHERE (toYear(date) = 2008) AND ((toMonth(date) >= 1) AND (toMonth(date) <= 10))
GROUP BY mon
 
┌────count()─┬─mon─┐
│ 2100162604 │   1 │
│ 1969757069 │   2 │
│ 2081371530 │   3 │
│ 2156878512 │   4 │
│ 2476890621 │   5 │
│ 2526662896 │   6 │
│ 2489723244 │   7 │
│ 2480356358 │   8 │
│ 2522746544 │   9 │
│ 2614372352 │  10 │
└────────────┴─────┘
 
10 rows in set. Elapsed: 37.450 sec. Processed 23.37 billion rows, 46.74 GB (623.97 million rows/s., 1.25 GB/s.)

Spark:

spark-sql> select count(*), month(dt) as mon from wikistat where year(dt)=2008 and month(dt) between 1 and 10 group by month(dt) order by month(dt);
2100162604      1                                                               
1969757069      2
2081371530      3
2156878512      4
2476890621      5
2526662896      6
2489723244      7
2480356358      8
2522746544      9
2614372352      10
Time taken: 792.552 seconds, Fetched 10 row(s)

Query 3

SELECT
path,
count(*),
sum(hits) AS sum_hits,
round(sum(hits) / count(*), 2) AS hit_ratio
FROM wikistat
WHERE project = ‘en‘
GROUP BY path
ORDER BY sum_hits DESC
LIMIT 100;

ClickHouse:

:) SELECT
:-]     path,
:-]     count(*),
:-]     sum(hits) AS sum_hits,
:-]     round(sum(hits) / count(*), 2) AS hit_ratio
:-] FROM wikistat
:-] WHERE (project = ‘en‘)
:-] GROUP BY path
:-] ORDER BY sum_hits DESC
:-] LIMIT 100;
 
SELECT 
    path,
    count(*),
    sum(hits) AS sum_hits,
    round(sum(hits) / count(*), 2) AS hit_ratio
FROM wikistat
WHERE project = ‘en‘
GROUP BY path
ORDER BY sum_hits DESC
LIMIT 100
 
┌─path────────────────────────────────────────────────┬─count()─┬───sum_hits─┬─hit_ratio─┐
│ Special:Search                                      │   44795 │ 4544605711 │ 101453.41 │
│ Main_Page                                           │   31930 │ 2115896977 │  66266.74 │
│ Special:Random                                      │   30159 │  533830534 │  17700.54 │
│ Wiki                                                │   10237 │   40488416 │   3955.11 │
│ Special:Watchlist                                   │   38206 │   37200069 │    973.67 │
│ YouTube                                             │    9960 │   34349804 │   3448.78 │
│ Special:Randompage                                  │    8085 │   28959624 │    3581.9 │
│ Special:AutoLogin                                   │   34413 │   24436845 │    710.11 │
│ Facebook                                            │    7153 │   18263353 │   2553.24 │
│ Wikipedia                                           │   23732 │   17848385 │    752.08 │
│ Barack_Obama                                        │   13832 │   16965775 │   1226.56 │
│ index.html                                          │    6658 │   16921583 │   2541.54 │
…
 
100 rows in set. Elapsed: 398.550 sec. Processed 26.88 billion rows, 1.24 TB (67.45 million rows/s., 3.10 GB/s.)

Spark:

spark-sql> SELECT
         >     path,
         >     count(*),
         >     sum(hits) AS sum_hits,
         >     round(sum(hits) / count(*), 2) AS hit_ratio
         > FROM wikistat
         > WHERE (project = ‘en‘)
         > GROUP BY path
         > ORDER BY sum_hits DESC
         > LIMIT 100;
...
Time taken: 2522.903 seconds, Fetched 100 row(s)

参考:

https://www.percona.com/blog/2017/02/13/clickhouse-new-opensource-columnar-database/

https://clickhouse.yandex/docs/en/single/#ClickHouse%20features%20that%20can%20be%20considered%20disadvantages

ClickHouse开源数据库

标签：intern bytes 格式 rtu 溢出 new 规格 cat max

ClickHouse开源数据库

ClickHouse vs. Spark

人气教程排行