时间:2021-07-01 10:21:17 帮助过:12人阅读
利用 IMDB 数据进行 Sentiment Analysis。
通过 keras.datasets 里面下载,注意下载的结构,并进行预处理。
from keras.datasets import imdb from keras import preprocessing # Number of words to consider as features max_features = 10000 # Cut texts after this number of words # (among top max_features most common words) maxlen = 20 # Load the data as lists of integers. (x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=max_features)
x_train
y_train: 二分类 0 和 1
需要对文本长度进行调节
# This turns our lists of integers # into a 2D integer tensor of shape `(samples, maxlen)` x_train = preprocessing.sequence.pad_sequences(x_train, maxlen=maxlen) x_test = preprocessing.sequence.pad_sequences(x_test, maxlen=maxlen)
长度设置为 maxlen=20。
得到的矩阵可以直接作为 Embedding 层的输入数据。
参考:填充序列pad_sequences
keras.preprocessing.sequence.pad_sequences(sequences, maxlen=None, dtype=‘int32‘, padding=‘pre‘, truncating=‘pre‘, value=0.)
将长为nb_samples
的序列(标量序列)转化为形如(nb_samples,nb_timesteps)
2D numpy array。如果提供了参数maxlen
,nb_timesteps=maxlen
,否则其值为最长序列的长度。其他短于该长度的序列都会在后部填充0以达到该长度。长于nb_timesteps
的序列将会被截断,以使其匹配目标长度。padding和截断发生的位置分别取决于padding
和truncating
.
sequences:浮点数或整数构成的两层嵌套列表
maxlen:None或整数,为序列的最大长度。大于此长度的序列将被截短,小于此长度的序列将在后部填0.
dtype:返回的numpy array的数据类型
padding:‘pre’或‘post’,确定当需要补0时,在序列的起始还是结尾补
truncating:‘pre’或‘post’,确定当需要截断序列时,从起始还是结尾截断
value:浮点数,此值将在填充时代替默认的填充值0
返回形如(nb_samples,nb_timesteps)
的2D张量
>>> a = np.array([[2, 3], [3, 4, 6], [7, 8, 9, 10]]) >>> a array([list([2, 3]), list([3, 4, 6]), list([7, 8, 9, 10])], dtype=object) >>> import keras Using TensorFlow backend. >>> b = keras.preprocessing.sequence.pad_sequences(a, maxlen=10) >>> b array([[ 0, 0, 0, 0, 0, 0, 0, 0, 2, 3], [ 0, 0, 0, 0, 0, 0, 0, 3, 4, 6], [ 0, 0, 0, 0, 0, 0, 7, 8, 9, 10]]) >>> c = keras.preprocessing.sequence.pad_sequences(a, maxlen=10, padding=‘post‘) >>> c array([[ 2, 3, 0, 0, 0, 0, 0, 0, 0, 0], [ 3, 4, 6, 0, 0, 0, 0, 0, 0, 0], [ 7, 8, 9, 10, 0, 0, 0, 0, 0, 0]]) >>> d = keras.preprocessing.sequence.pad_sequences(a, maxlen=3, padding=‘post‘) >>> d array([[ 2, 3, 0], [ 3, 4, 6], [ 8, 9, 10]]) >>> e = keras.preprocessing.sequence.pad_sequences(a, maxlen=3) >>> e array([[ 0, 2, 3], [ 3, 4, 6], [ 8, 9, 10]]) >>> f = keras.preprocessing.sequence.pad_sequences(a, maxlen=3, padding=‘post‘, truncating=‘post‘) >>> f array([[2, 3, 0], [3, 4, 6], [7, 8, 9]])
【506】keras 读取及处理 IMDB 数据库
标签:长度 截断 datasets 其他 set list ken padding ras