当前位置:Gxlcms > 数据库问题 > ClickHouse开源数据库

ClickHouse开源数据库

时间:2021-07-01 10:21:17 帮助过:26人阅读

min(toMonth(date)), max(toMonth(date)), path, count(*), sum(hits), sum(hits) / count(*) AS hit_ratio FROM wikistat WHERE (project = en) GROUP BY path ORDER BY hit_ratio DESC LIMIT 10 ↖ Progress: 1.83 billion rows, 85.31 GB (68.80 million rows/s., 3.21 GB/s.) ██████████▋ 6%Received exception from server: Code: 241. DB::Exception: Received from localhost:9000, 127.0.0.1. DB::Exception: Memory limit (for query) exceeded: would use 9.31 GiB (attempt to allocate chunk of 1048576 bytes), maximum: 9.31 GiB: (while reading column hits):

默认情况下,ClickHouse会限制group by使用的内存量(它使用 hash table来处理group by)。这很容易解决 - 如果你有空闲的内存,增加这个参数:

SET max_memory_usage = 128000000000; #128G,

 

如果你没有那么多的内存可用,ClickHouse可以通过设置这个“溢出”数据到磁盘:

set max_bytes_before_external_group_by=20000000000; #20G
set max_memory_usage=40000000000; #40G

根据文档,如果需要使用max_bytes_before_external_group_by,建议将max_memory_usage设置为max_bytes_before_external_group_by大小的两倍。

(原因是聚合需要分两个阶段进行:1.查询并且建立中间数据 2.合并中间数据。 数据“溢出”到磁盘一般发生在第一个阶段,如果没有发生数据“溢出”,ClickHouse在阶段1和阶段2可能需要相同数量的内存)

 

 性能对比:

ClickHouse vs. Spark

ClickHouse 和Spark 都是分布式的; 但是为了方便测试我们只进行单节点测试。 测试结果令人震惊

Size / compression Spark v. 2.0.2 ClickHouse
数据存储格式 Parquet, compressed: snappy Internal storage, compressed
Size (uncompressed: 1.2TB) 395G 212G

 

 技术分享图片

 Test  Spark v. 2.0.2 ClickHouse Diff
Query 1: count (warm) 7.37 sec (no disk IO) 6.61 sec  ~same
Query 2: simple group (warm)  792.55 sec (no disk IO) 37.45 sec 21x better
Query 3: complex group by   2522.9 sec  398.55 sec  6.3x better

 

附录:

硬件

  • CPU: 24xIntel(R) Xeon(R) CPU L5639 @ 2.13GHz (physical = 2, cores = 12, virtual = 24, hyperthreading = yes)
  • Disk: 2 consumer grade SSD in software RAID 0 (mdraid)

Query 1

select count(*) from wikistat

ClickHouse:

:)  select count(*) from wikistat;
 
SELECT count(*)
FROM wikistat 
 
┌─────count()─┐
│ 26935251789 │
└─────────────┘
 
1 rows in set. Elapsed: 6.610 sec. Processed 26.88 billion rows, 53.77 GB (4.07 billion rows/s., 8.13 GB/s.) 

 

Spark:

spark-sql> select count(*) from wikistat;
26935251789
Time taken: 7.369 seconds, Fetched 1 row(s)

Query 2

select count(*), month(dt) as mon
from wikistat where year(dt)=2008
and month(dt) between 1 and 10
group by month(dt)
order by month(dt);

ClickHouse:

:) select count(*), toMonth(date) as mon from wikistat 
where toYear(date)=2008 and toMonth(date) between 1 and 10 group by mon;
 
SELECT 
    count(*), 
    toMonth(date) AS mon
FROM wikistat 
WHERE (toYear(date) = 2008) AND ((toMonth(date) >= 1) AND (toMonth(date) <= 10))
GROUP BY mon
 
┌────count()─┬─mon─┐
│ 21001626041 │
│ 19697570692 │
│ 20813715303 │
│ 21568785124 │
│ 24768906215 │
│ 25266628966 │
│ 24897232447 │
│ 24803563588 │
│ 25227465449 │
│ 261437235210 │
└────────────┴─────┘
 
10 rows in set. Elapsed: 37.450 sec. Processed 23.37 billion rows, 46.74 GB (623.97 million rows/s., 1.25 GB/s.) 

Spark:

spark-sql> select count(*), month(dt) as mon from wikistat where year(dt)=2008 and month(dt) between 1 and 10 group by month(dt) order by month(dt);
2100162604      1                                                               
1969757069      2
2081371530      3
2156878512      4
2476890621      5
2526662896      6
2489723244      7
2480356358      8
2522746544      9
2614372352      10
Time taken: 792.552 seconds, Fetched 10 row(s)

 

Query 3

SELECT
path,
count(*),
sum(hits) AS sum_hits,
round(sum(hits) / count(*), 2) AS hit_ratio
FROM wikistat
WHERE project = en
GROUP BY path
ORDER BY sum_hits DESC
LIMIT 100;

ClickHouse:

:) SELECT
:-]     path,
:-]     count(*),
:-]     sum(hits) AS sum_hits,
:-]     round(sum(hits) / count(*), 2) AS hit_ratio
:-] FROM wikistat
:-] WHERE (project = en)
:-] GROUP BY path
:-] ORDER BY sum_hits DESC
:-] LIMIT 100;
 
SELECT 
    path,
    count(*),
    sum(hits) AS sum_hits,
    round(sum(hits) / count(*), 2) AS hit_ratio
FROM wikistat
WHERE project = en
GROUP BY path
ORDER BY sum_hits DESC
LIMIT 100
 
┌─path────────────────────────────────────────────────┬─count()─┬───sum_hits─┬─hit_ratio─┐
│ Special:Search                                      │   447954544605711101453.41 │
│ Main_Page                                           │   31930211589697766266.74 │
│ Special:Random                                      │   3015953383053417700.54 │
│ Wiki                                                │   10237404884163955.11 │
│ Special:Watchlist                                   │   3820637200069973.67 │
│ YouTube                                             │    9960343498043448.78 │
│ Special:Randompage                                  │    8085289596243581.9 │
│ Special:AutoLogin                                   │   3441324436845710.11 │
│ Facebook                                            │    7153182633532553.24 │
│ Wikipedia                                           │   2373217848385752.08 │
│ Barack_Obama                                        │   13832169657751226.56 │
│ index.html                                          │    6658169215832541.54 │
…
 
100 rows in set. Elapsed: 398.550 sec. Processed 26.88 billion rows, 1.24 TB (67.45 million rows/s., 3.10 GB/s.)

Spark:

spark-sql> SELECT
         >     path,
         >     count(*),
         >     sum(hits) AS sum_hits,
         >     round(sum(hits) / count(*), 2) AS hit_ratio
         > FROM wikistat
         > WHERE (project = en)
         > GROUP BY path
         > ORDER BY sum_hits DESC
         > LIMIT 100;
...
Time taken: 2522.903 seconds, Fetched 100 row(s)

 

 

参考:

https://www.percona.com/blog/2017/02/13/clickhouse-new-opensource-columnar-database/

https://clickhouse.yandex/docs/en/single/#ClickHouse%20features%20that%20can%20be%20considered%20disadvantages

 

 

ClickHouse开源数据库

标签:intern   bytes   格式   rtu   溢出   new   规格   cat   max   

人气教程排行