当前位置：Gxlcms > 数据库问题 > Paper Reading_Database and Storage System

Paper Reading_Database and Storage System

时间：2021-07-01 10:21:17 帮助过：27人阅读

。现在是时候用ML model来predict（estimate the number of queries that the application will execute in the future）啦。这也是我们要关注的重点。因为NoSQL语义比较少，不需要前两步啊哈哈哈哈。

Forecast这一节我们专门拉出来看。本文考察了以下几个model：

Linear Regression (LR)：比较简单嘛......它的优势是简单，可以避免过拟合，需要的训练数据也比较少。
Kernel Regression (KR)：相对于Linear Regression，Kernel方法可以实现对一些非线性函数的回归。看似很猛如虎，但它也比较复杂，需要的训练数据多，容易过拟合。在作者的实验中，LR对短时间（比如1小时）内的prediction比较好，而KR对长时间（比如一天）的prediction比较好。Kernel方法也分很多种......本文中使用的是Nadaraya–Watson kernel regression，这也是一个无参数的方法。？？？？
RNN：其实这里就用的LSTM。LSTM可以“记忆”数据在过去时间段的一些特征，并应用到预测未来上。它的缺点也是模型复杂，需要的sample多。（所以ML中并不是模型越复杂就越好......是需要根据实际情况做很多的tradeoff的）
ENSEMBLE：另外有一种方法叫做ensemble learning，可以结合两种不同模型的优点。在本文中作者就把LR和RNN给ensemble到了一起，拼成了一个新model（equally averaging the prediction results of the LR and RNN models），这里我们记为ENSEMBLE

作者在实验中发现：

1). ENSEMBLE方法的average prediction accuracy比较好，但它无法预测出带有周期性尖峰的workload。比如对于大学申请的网站，每年的申请deadline之前都是一个高峰，但平时没什么人访问（见Fig1b）。
2). KR的average prediction accuracy并不如ENSEMBLE，但是只有它能预测出带有尖峰的workload。

综合上述结果，作者提出了一个HYBRID模型，自动根据不同情况决定采用上面哪种model。针对尖峰这个feature我们可以这样设计：如果KR预测出的结果（its predicted workload volume）比ENSEMBLE的结果大K%（K是一个threshold，这里定为150%），就使用KR的结果，否则使用ENSEMBLE的结果。

接下来还有些细节需要考虑。比如Prediction Horizons（要预测多长时间内的workload，相当于Regression图上x坐标的范围）和Prediction Intervals（模型的直接输出结果是预测多长时间内的Query数量，相当于x坐标上每个点的单位）。这里作者把interval设为一分钟。也就是对于horizon时间段内的每一分钟，预测这一分钟内的arrival rate of queries。（还有些细节可参考6.2节）

最后是实验环节啦。对于cluster，结论是选择top5的类就可以了，后面训练的时候把这top5的cluster混合到一起作为training data，只训练一个model。我们重点关注Forecast的效果。在下面的实验中，使用过去3周的数据Train，然后实验了不同的Prediction Horizon。

对于模型的选择，理论上说应该是不能选择对超参数过于敏感的模型，这是因为fne-tuning a model’s hyperparameters is by itself a hard optimization task。而在实验中也证实了这一点（Fig7，但这里好像没有说实验的Prediction interval.....好在后面还会专门讨论这个）：

在Horizon比较短的时候（一天以内），LR比RNN的表现还要好。因为短时间内的the relationship between the arrival rate observed in the recent past and the arrival rate in the near future is more linear than for longer horizons。而复杂的模型反倒有可能过拟合。
在Horizon比较长的时候（一天及以上），the relationship between the past and the future also grows in complexity。RNN的表现会比较好
综合了上述两个model的ENSEMBLE的overall性能最好。
只有HYBRID可以预测到带有尖峰的workload pattern

另外作者还选择了如下几个模型做对比：

Autoregressive Moving Average (ARMA)：这是一种时间序列模型（其实时间序列也是一个可以考虑的方向）。但是它的效果并不是很稳定，在38%的case下都不好。这是因为the model is sensitive to its hyperparameters. The optimal hyperparameter settings for ARMA are highly dependent on the statistical properties of the data, such as stationarity and the autocorrelation structure.
Feed-forward Neural Network (FNN)：就是普通的多层神经网络...被RNN完爆
Predictive State Recurrent Neural Network (PSRNN)：LSTM的一个变种。然而在这里还是被LSTM完爆...
Kernel Regression (KR)：单纯KR的效果其实一般般...因为it is prone to error when it has not seen inputs in training that are close to the input to make the prediction with。但是后面预测尖峰的时候还真是会用到它

对于有尖峰的workload pattern，这里使用1 hour的interval，用full workload history来训练，然后试图预测出一周后的workload是否会出现尖峰。在这种情况下，只有KR顺利预测出了尖峰（Fig9）。这是因为its prediction is based on the distance between the test points and training data, where the in?uence of each training data point decreases exponentially with its distance from the test point。

对于Prediction Horizon的选择，在BusTracker的数据上测试发现还是Prediction Horizon比较短的时候效果好...比如1 hour的时候就比1 week好（Fig8）

对于Prediction Interval的选择，interval越短的效果会越好（shorter intervals provide more training samples and better information for learning），但interval太小了也会导致noise多，模型更复杂，训练也更久，因此这也是个tradeoff。最终实验发现总的来说1 hour的interval比较好。（一个future work就是自动设置interval）

Link：

https://zhuanlan.zhihu.com/p/37182849

https://github.com/pentium3/QueryBot5000

Characterizing, Modeling, and Benchmarking RocksDB Key-ValueWorkloads at Facebook

这篇paper介绍了如何分析实际场景中的workload，以三种典型的workload（UDB，ZippyDB，UP2X）为例。

之前我们用ycsb等工具做过benchmark，但它只能支持有限的几种key-value分布[ YCSB-generated workloads ignore key space localities. In YCSB, hot KV-pairs are either randomly allocated across the whole key-space or clustered together. This results in an I/O mismatch between accessed data blocks in storage and the data blocks associated with KV queries. ]，很难准确的模拟实际情况。而本文的工作是将线上的实际数据进行trace，然后replay并且analyze。在分析的过程中，我们重点关注热点数据落在哪些kv区间[ The whole key space is partitioned into small key-ranges, and we model the hotness of these small key-ranges. ]，试图发现其中和业务场景相关的一些pattern，然后在设计benchmark时，queries are assigned to key-ranges based on the distribution of key-range hotness, and hot keys are allocated closely in each key-range.

Trace和Replay都是比较工程化的东西...也不太是本文的重点。第三章介绍了这一过程中需要记录的一些metrics。

接下来都是analyze出来的一些东西：

ch4.1 不同种类Query（get/put/...）的占比
ch4.2 KV pairs的热点分布情况
ch4.3 QPS的变化情况，可以从中看出不同时段的访问量
ch5 key/value的大小情况
ch6

在analyze之后，作者探索了是否能用benchmark工具来模拟出尽可能和现实情况像的workload。

Blockchain

....

Spatial DB

...

Transaction

...

Query Optimizer

...

GraphDB

...

Security / Privacy

...

....

Paper Reading_Database and Storage System

标签：data with 上层 abs task hyper 总数 using OLE

Paper Reading_Database and Storage System

Characterizing, Modeling, and Benchmarking RocksDB Key-ValueWorkloads at Facebook

Blockchain

Spatial DB

Transaction

Query Optimizer

GraphDB

Security / Privacy

人气教程排行