时间:2021-07-01 10:21:17 帮助过:8人阅读
- <code class="language-python hljs ">amazon = pd.read_csv(<span class="hljs-string">"C:/Users/cs/Desktop/Amazon/train.csv"</span>)
- data =amazon
- data.head()</code>
ACTION | RESOURCE | MGR_ID | ROLE_ROLLUP_1 | ROLE_ROLLUP_2 | ROLE_DEPTNAME | ROLE_TITLE | ROLE_FAMILY_DESC | ROLE_FAMILY | ROLE_CODE | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 39353 | 85475 | 117961 | 118300 | 123472 | 117905 | 117906 | 290919 | 117908 |
1 | 1 | 17183 | 1540 | 117961 | 118343 | 123125 | 118536 | 118536 | 308574 | 118539 |
2 | 1 | 36724 | 14457 | 118219 | 118220 | 117884 | 117879 | 267952 | 19721 | 117880 |
3 | 1 | 36135 | 5396 | 117961 | 118343 | 119993 | 118321 | 240983 | 290919 | 118322 |
4 | 1 | 42680 | 5905 | 117929 | 117930 | 119569 | 119323 | 123932 | 19793 | 119325 |
train数据集共有32769个样本,不存在缺失值
- <code class="language-python hljs ">data.info()</code>
- <code class="language-python hljs ">data.describe()</code>
ACTION | RESOURCE | MGR_ID | ROLE_ROLLUP_1 | ROLE_ROLLUP_2 | ROLE_DEPTNAME | ROLE_TITLE | ROLE_FAMILY_DESC | ROLE_FAMILY | ROLE_CODE | |
---|---|---|---|---|---|---|---|---|---|---|
count | 32769.000000 | 32769.000000 | 32769.000000 | 32769.000000 | 32769.000000 | 32769.000000 | 32769.000000 | 32769.000000 | 32769.000000 | 32769.000000 |
mean | 0.942110 | 42923.916171 | 25988.957979 | 116952.627788 | 118301.823156 | 118912.779914 | 125916.152644 | 170178.369648 | 183703.408893 | 119789.430132 |
std | 0.233539 | 34173.892702 | 35928.031650 | 10875.563591 | 4551.588572 | 18961.322917 | 31036.465825 | 69509.462130 | 100488.407413 | 5784.275516 |
min | 0.000000 | 0.000000 | 25.000000 | 4292.000000 | 23779.000000 | 4674.000000 | 117879.000000 | 4673.000000 | 3130.000000 | 117880.000000 |
25% | 1.000000 | 20299.000000 | 4566.000000 | 117961.000000 | 118102.000000 | 118395.000000 | 118274.000000 | 117906.000000 | 118363.000000 | 118232.000000 |
50% | 1.000000 | 35376.000000 | 13545.000000 | 117961.000000 | 118300.000000 | 118921.000000 | 118568.000000 | 128696.000000 | 119006.000000 | 118570.000000 |
75% | 1.000000 | 74189.000000 | 42034.000000 | 117961.000000 | 118386.000000 | 120535.000000 | 120006.000000 | 235280.000000 | 290919.000000 | 119348.000000 |
max | 1.000000 | 312153.000000 | 311696.000000 | 311178.000000 | 286791.000000 | 286792.000000 | 311867.000000 | 311867.000000 | 308574.000000 | 270691.000000 |
查看各变量上不同编号的种类数。可以发现,在30000多个样本中,RESOURCE、MGR_ID和ROLE_FAMILY上编号种类数较多,其他变量上编号种类数较少。
值得注意的是,ROLE_TITLE和ROLE_CODE种类数一致。
ACTION 2 RESOURCE 7518 MGR_ID 4243 ROLE_ROLLUP_1 128 ROLE_ROLLUP_2 177 ROLE_DEPTNAME 449 ROLE_TITLE 343 ROLE_FAMILY_DESC 2358 ROLE_FAMILY 67 ROLE_CODE 343 dtype: int64
- <code class="language-python hljs ">f = <span class="hljs-keyword">lambda</span> x: x.unique().size
- data.apply(f)</code>
画出ROLE_TITLE和ROLE_CODE变量的散点图,存在明显的正相关关系。
- <code class="language-python hljs ">fig,ax = plt.subplots(nrows=<span class="hljs-number">1</span>,ncols=<span class="hljs-number">1</span>,figsize=(<span class="hljs-number">8</span>,<span class="hljs-number">5</span>))
- plt.scatter(data.ROLE_TITLE,data.ROLE_CODE)</code>
- <code><matplotlib.collections.PathCollection at 0xabb0c50>
- </code>
将两个变量的值合并,编号的种类数目仍为343,
343
- <code class="language-python hljs ">TITLE_CODE = data.ROLE_TITLE*<span class="hljs-number">1000000</span>+data.ROLE_CODE
- TITLE_CODE.unique().size</code>
- <code class="language-python hljs "><span class="hljs-comment"># 定义f2,用来计算交叉表每一行或每一列中非0值的个数</span>
- f2 = <span class="hljs-keyword">lambda</span> x: x[x!=<span class="hljs-number">0</span>].count()
- <span class="hljs-comment"># 画出两个变量间的交叉表</span>
- TICO = pd.crosstab(data.ROLE_TITLE,data.ROLE_CODE)
- <span class="hljs-comment"># 观察交叉表中ROLE_CODE变量对应的ROLE_TITLE变量个数</span>
- TICO.apply(f2).plot()
- <span class="hljs-comment"># 在变量ROLE_CODE上,对应的ROLE_TITLE个数为0,说明两个变量间至少存在一对多的对应关系</span>
- TICO.apply(f2)[TICO.apply(f2)><span class="hljs-number">1</span>]</code>
- <code>Series([], dtype: int64)
- </code>
- <code class="language-python hljs ">
- 观察交叉表中ROLE_TITLE变量对应的ROLE_CODE变量个数,也为<span class="hljs-number">0</span>,说明两个变量间存在一一对应的关系
- TICO.apply(f2,axis=<span class="hljs-number">1</span>).plot()
- TICO.apply(f2,axis=<span class="hljs-number">1</span>)[TICO.apply(f2,axis=<span class="hljs-number">1</span>)><span class="hljs-number">1</span>]</code>
- <code>Series([], dtype: int64)
- </code>
(128, 449, 1185)
- <code class="language-python hljs "><span class="hljs-comment"># 将两个变量的值合并,编号的种类数目发生了较大的变化,但仍可发现,存在一定的对应关系</span>
- RO1_DEP= data.ROLE_ROLLUP_1*<span class="hljs-number">10000000</span>+data.ROLE_DEPTNAME
- data.ROLE_ROLLUP_1.unique().size, data.ROLE_DEPTNAME.unique().size, RO1_DEP.unique().size
- <span class="hljs-comment"># ctRO1DEP = pd.crosstab(data.ROLE_ROLLUP_1,data.ROLE_DEPTNAME)</span></code>
(177, 449, 1398)
- <code class="language-python hljs "><span class="hljs-comment"># 将两个变量的值合并,编号的种类数目发生了较大的变化,但仍可发现,存在一定的对应关系</span>
- RO2_DEP= data.ROLE_ROLLUP_2*<span class="hljs-number">10000000</span>+data.ROLE_DEPTNAME
- data.ROLE_ROLLUP_2.unique().size, data.ROLE_DEPTNAME.unique().size, RO2_DEP.unique().size</code>
(128, 177, 187)
- <code class="language-python hljs "><span class="hljs-comment"># 将两个变量合并,编号的唯一值数目变化不大,说明两者之间存在很强的对应关系</span>
- RO1_RO2= data.ROLE_ROLLUP_1*<span class="hljs-number">10000000</span>+data.ROLE_ROLLUP_2
- data.ROLE_ROLLUP_1.unique().size, data.ROLE_ROLLUP_2.unique().size, RO1_RO2.unique().size</code>
- <code class="language-python hljs "><span class="hljs-comment">#画出两个变量间的交叉表</span>
- ctRO12 = pd.crosstab(data.ROLE_ROLLUP_1,data.ROLE_ROLLUP_2)
- <span class="hljs-comment"># 观察交叉表中ROLE_ROLLUP_2变量对应的ROLE_ROLLUP_1变量个数</span>
- ctRO12.apply(f2).plot()
- <span class="hljs-comment"># 在变量ROLE_ROLLUP_2上,只有三个值对应的ROLE_ROLLUP_1个数大于1(非一一对应关系),说明两个变量间有很强的一对多的对应关系</span>
- ctRO12.apply(f2)[ctRO12.apply(f2)><span class="hljs-number">1</span>]</code>
- <code>ROLE_ROLLUP_2
- 118164 2
- 118178 2
- 119256 9
- dtype: int64
- </code>
(1 380 0 36 Name: ACTION, dtype: int64, 416)
- <code class="language-python hljs "><span class="hljs-comment"># 统计ROLE_ROLLUP_2编号为118164、118178和119356样本的数目,样本数目的变量并不多,但总体上,未通过授权的比率比平均高</span>
- a = data.ROLE_ROLLUP_2[(data.ROLE_ROLLUP_2==<span class="hljs-number">118164</span>) | (data.ROLE_ROLLUP_2==<span class="hljs-number">118178</span>)| (data.ROLE_ROLLUP_2==<span class="hljs-number">119256</span>)].count()
- b = data.ACTION[(data.ROLE_ROLLUP_2==<span class="hljs-number">118164</span>) | (data.ROLE_ROLLUP_2==<span class="hljs-number">118178</span>)| (data.ROLE_ROLLUP_2==<span class="hljs-number">119256</span>)].value_counts()
- b,a</code>
32
- <code class="language-python hljs "><span class="hljs-comment"># 观察交叉表中ROLE_ROLLUP_1变量对应的ROLE_ROLLUP_2变量个数</span>
- <span class="hljs-comment"># ctRO12.apply(f,axis=1).plot()</span>
- <span class="hljs-comment"># 在变量ROLE_ROLLUP_1上,有32个值对应的ROLE_ROLLUP_2个数大于1</span>
- ctRO12.apply(f2,axis=<span class="hljs-number">1</span>)[ctRO12.apply(f2,axis=<span class="hljs-number">1</span>)><span class="hljs-number">1</span>].count()</code>
(2358, 67, 2586)
- <code class="language-python hljs "><span class="hljs-comment"># 将两个变量合并,编号的唯一值数目变化不大,说明两者之间存在很强的对应关系</span>
- FA_DESC= data.ROLE_FAMILY_DESC*<span class="hljs-number">1000000</span>+data.ROLE_FAMILY
- data.ROLE_FAMILY_DESC.unique().size,data.ROLE_FAMILY.unique().size, FA_DESC.unique().size</code>
- <code class="language-python hljs "><span class="hljs-comment">#画出两个变量间的交叉表</span>
- ctFAFA = pd.crosstab(data.ROLE_FAMILY,data.ROLE_FAMILY_DESC)
- <span class="hljs-comment"># 在变量ROLE_FAMILY_DESC上,有170个值对应的ROLE_FAMILY个数大于1,</span>
- <span class="hljs-comment"># 在变量ROLE_FAMILY上,有59个值对应的ROLE_FAMILY_DESC个数大于1,说明两个变量间有较强的一对多的对应关系</span></code>
(343, 67, 343)
- <code class="language-python hljs "><span class="hljs-comment"># 将两个变量合并,唯一值没有发生变化,说明两者之间可能存在一对多关系</span>
- TIFA = data.ROLE_TITLE*<span class="hljs-number">1000000</span>+data.ROLE_FAMILY
- data.ROLE_TITLE.unique().size, data.ROLE_FAMILY.unique().size, TIFA.unique().size</code>
- <code class="language-python hljs "><span class="hljs-comment">#画出两个变量间的交叉表</span>
- ctTIFA = pd.crosstab(data.ROLE_TITLE,data.ROLE_FAMILY)
- <span class="hljs-comment"># 观察交叉表中ROLE_TITLE变量对应的ROLE_FAMILY变量个数</span>
- ctTIFA.apply(f2,axis=<span class="hljs-number">1</span>).plot()
- <span class="hljs-comment"># 可以发现,ROLE_TITLE 与ROLE_FAMILY之间存在着一对多的关系,</span>
- ctTIFA.apply(f2,axis=<span class="hljs-number">1</span>)[ctTIFA.apply(f2,axis=<span class="hljs-number">1</span>)><span class="hljs-number">1</span>].count()</code>
- <code>0
- </code>
- <code class="language-python hljs "><span class="hljs-comment"># 画出变量ACTION的条形图,大部分的申请都被授权</span>
- fig,ax = plt.subplots(figsize=(<span class="hljs-number">8</span>,<span class="hljs-number">5</span>))
- data.ACTION.value_counts().plot(kind=<span class="hljs-string">"bar"</span>,color=<span class="hljs-string">"lightblue"</span>)
- ax.set_xticklabels((<span class="hljs-string">"Accessed"</span>,<span class="hljs-string">"Not Accessed"</span>), rotation= <span class="hljs-string">"horizontal"</span> )
- ax.set_title(<span class="hljs-string">"Bar plot of Action"</span>)</code>
- <code><matplotlib.text.Text at 0xd173080>
- </code>
- <code class="language-python hljs "><span class="hljs-comment"># 画出其余变量的分布直方图,RESOURCE和MGR_ID变量的编号大多分布在0-1000000上,且分布相对离散,其余变量分布都集中在一定的值和区域内。</span>
- <span class="hljs-comment"># 如变量ROLE_ROLLUP_1上,有21407个样本编号为117961;在ROLE_FAMILY上有10980个样本的编号为290919。</span>
- <span class="hljs-comment"># data.ROLE_ROLLUP_1.value_counts(),data.ROLE_FAMILY.value_counts()</span>
- fig,ax = plt.subplots(nrows=<span class="hljs-number">4</span>,ncols=<span class="hljs-number">2</span>,figsize=(<span class="hljs-number">20</span>,<span class="hljs-number">40</span>))
- data.RESOURCE.hist(ax=ax[<span class="hljs-number">0</span>,<span class="hljs-number">0</span>],bins=<span class="hljs-number">100</span>)
- ax[<span class="hljs-number">0</span>,<span class="hljs-number">0</span>].set_title(<span class="hljs-string">"Hist plot of RESOURCE"</span>)
- data.MGR_ID.hist(ax=ax[<span class="hljs-number">0</span>,<span class="hljs-number">1</span>],bins=<span class="hljs-number">100</span>)
- ax[<span class="hljs-number">0</span>,<span class="hljs-number">1</span>].set_title(<span class="hljs-string">"Hist plot of MGR_ID"</span>)
- data.ROLE_ROLLUP_1.hist(ax=ax[<span class="hljs-number">1</span>,<span class="hljs-number">0</span>],bins=<span class="hljs-number">100</span>)
- ax[<span class="hljs-number">1</span>,<span class="hljs-number">0</span>].set_title(<span class="hljs-string">"Hist plot of ROLE_ROLLUP_1"</span>)
- data.ROLE_ROLLUP_2.hist(ax=ax[<span class="hljs-number">1</span>,<span class="hljs-number">1</span>],bins=<span class="hljs-number">100</span>)
- ax[<span class="hljs-number">1</span>,<span class="hljs-number">1</span>].set_title(<span class="hljs-string">"Hist plot of ROLE_ROLLUP_2"</span>)
- data.ROLE_DEPTNAME.hist(ax=ax[<span class="hljs-number">2</span>,<span class="hljs-number">0</span>],bins=<span class="hljs-number">100</span>)
- ax[<span class="hljs-number">2</span>,<span class="hljs-number">0</span>].set_title(<span class="hljs-string">"ROLE_DEPTNAME"</span>)
- data.ROLE_TITLE.hist(ax=ax[<span class="hljs-number">2</span>,<span class="hljs-number">1</span>],bins=<span class="hljs-number">100</span>)
- ax[<span class="hljs-number">2</span>,<span class="hljs-number">1</span>].set_title(<span class="hljs-string">"Hist plot of ROLE_TITLE"</span>)
- data.ROLE_FAMILY_DESC.hist(ax=ax[<span class="hljs-number">3</span>,<span class="hljs-number">0</span>],bins=<span class="hljs-number">100</span>)
- ax[<span class="hljs-number">3</span>,<span class="hljs-number">0</span>].set_title(<span class="hljs-string">"Hist plot of ROLE_FAMILY_DESC"</span>)
- data.ROLE_FAMILY.hist(ax=ax[<span class="hljs-number">3</span>,<span class="hljs-number">1</span>],bins=<span class="hljs-number">100</span>)
- ax[<span class="hljs-number">3</span>,<span class="hljs-number">1</span>].set_title(<span class="hljs-string">"Hist plot of ROLE_FAMILY"</span>)</code>
- <code><matplotlib.text.Text at 0xdb2fa90>
- </code>
- <code class="language-python hljs "><span class="hljs-comment"># 画出变量间相关系数矩阵图,变量编号的值之间并没有明显的线性关系</span>
- cm = np.corrcoef(data.values.T)
- sns.set(font_scale=<span class="hljs-number">1</span>)
- cols = data.columns
- hm = sns.heatmap(cm,
- cbar=<span class="hljs-keyword">True</span>,
- annot=<span class="hljs-keyword">True</span>,
- square=<span class="hljs-keyword">True</span>,
- fmt=<span class="hljs-string">‘.2f‘</span>,
- annot_kws={<span class="hljs-string">‘size‘</span>: <span class="hljs-number">10</span>},
- yticklabels=cols,
- xticklabels=cols)
- plt.tight_layout()
- plt.show()</code>
- <code class="language-python hljs "><span class="hljs-comment"># 由于ROLE_CODE和ROLE_FAMILY与ROLE_TITLE存在一对一和一对多的关系,认为他不能包含更多的信息,删去这两个变量</span>
- data = amazon
- <span class="hljs-keyword">del</span> data[<span class="hljs-string">"ROLE_CODE"</span>]
- <span class="hljs-keyword">del</span> data[<span class="hljs-string">"ROLE_FAMILY"</span>]</code>
- <code class="language-python hljs ">amazon = pd.read_csv(<span class="hljs-string">"C:/Users/cs/Desktop/Amazon/train.csv"</span>)</code>
- <code class="language-python hljs "><span class="hljs-comment"># 利用循环,得到每个自变量出现的频率,赋值到新的列中。</span>
- one= [<span class="hljs-string">"RESOURCE"</span>,<span class="hljs-string">"MGR_ID"</span>,<span class="hljs-string">"ROLE_ROLLUP_1"</span>,<span class="hljs-string">"ROLE_ROLLUP_2"</span>,<span class="hljs-string">"ROLE_DEPTNAME"</span>,<span class="hljs-string">"ROLE_TITLE"</span>,<span class="hljs-string">"ROLE_FAMILY_DESC"</span>]
- <span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> range(<span class="hljs-number">0</span>,len(one)):
- a=data[one[i]]
- b=data[one[i]].value_counts()/<span class="hljs-number">32769</span>
- a=a.map(b)
- data[one[i]+<span class="hljs-string">"_prob"</span>]=a
- data.head()</code>
ACTION | RESOURCE | MGR_ID | ROLE_ROLLUP_1 | ROLE_ROLLUP_2 | ROLE_DEPTNAME | ROLE_TITLE | ROLE_FAMILY_DESC | RESOURCE_prob | MGR_ID_prob | ROLE_ROLLUP_1_prob | ROLE_ROLLUP_2_prob | ROLE_DEPTNAME_prob | ROLE_TITLE_prob | ROLE_FAMILY_DESC_prob | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 39353 | 85475 | 117961 | 118300 | 123472 | 117905 | 117906 | 0.000092 | 0.001678 | 0.653270 | 0.135006 | 0.002197 | 0.109341 | 0.210443 |
1 | 1 | 17183 | 1540 | 117961 | 118343 | 123125 | 118536 | 118536 | 0.000915 | 0.000305 | 0.653270 | 0.120388 | 0.004852 | 0.002472 | 0.000366 |
2 | 1 | 36724 | 14457 | 118219 | 118220 | 117884 | 117879 | 267952 | 0.000061 | 0.000092 | 0.005615 | 0.005615 | 0.016662 | 0.038329 | 0.001007 |
3 | 1 | 36135 | 5396 | 117961 | 118343 | 119993 | 118321 | 240983 | 0.000031 | 0.001892 | 0.653270 | 0.120388 | 0.005798 | 0.141872 | 0.037963 |
4 | 1 | 42680 | 5905 | 117929 | 117930 | 119569 | 119323 | 123932 | 0.000244 | 0.000275 | 0.008423 | 0.004211 | 0.001373 | 0.002289 | 0.000580 |
- <code class="language-python hljs "><span class="hljs-comment"># 利用循环,得到每两个自变量同时出现的频率,赋值到新的列中。</span>
- two = [<span class="hljs-string">"RESOURCE"</span>,<span class="hljs-string">"MGR_ID"</span>,<span class="hljs-string">"ROLE_ROLLUP_1"</span>,<span class="hljs-string">"ROLE_ROLLUP_2"</span>,<span class="hljs-string">"ROLE_DEPTNAME"</span>,<span class="hljs-string">"ROLE_TITLE"</span>,<span class="hljs-string">"ROLE_FAMILY_DESC"</span>]
- <span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> range(<span class="hljs-number">0</span>,len(two)):
- <span class="hljs-keyword">for</span> j <span class="hljs-keyword">in</span> range(i+<span class="hljs-number">1</span>,len(two)):
- a=data[two[i]]+data[two[j]]*<span class="hljs-number">1000000</span>
- b=a.value_counts()/<span class="hljs-number">32769</span>
- a=a.map(b)
- data[two[i]+<span class="hljs-string">"_"</span>+two[j]+<span class="hljs-string">"_prob"</span>]=a
- data.head()</code>
ACTION | RESOURCE | MGR_ID | ROLE_ROLLUP_1 | ROLE_ROLLUP_2 | ROLE_DEPTNAME | ROLE_TITLE | ROLE_FAMILY_DESC | RESOURCE_prob | MGR_ID_prob | … | ROLE_ROLLUP_1_ROLE_ROLLUP_2_prob | ROLE_ROLLUP_1_ROLE_DEPTNAME_prob | ROLE_ROLLUP_1_ROLE_TITLE_prob | ROLE_ROLLUP_1_ROLE_FAMILY_DESC_prob | ROLE_ROLLUP_2_ROLE_DEPTNAME_prob | ROLE_ROLLUP_2_ROLE_TITLE_prob | ROLE_ROLLUP_2_ROLE_FAMILY_DESC_prob | ROLE_DEPTNAME_ROLE_TITLE_prob | ROLE_DEPTNAME_ROLE_FAMILY_DESC_prob | ROLE_TITLE_ROLE_FAMILY_DESC_prob | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 39353 | 85475 | 117961 | 118300 | 123472 | 117905 | 117906 | 0.000092 | 0.001678 | … | 0.135006 | 0.002014 | 0.089200 | 0.180659 | 0.002014 | 0.013855 | 0.033233 | 0.000671 | 0.001678 | 0.079557 |
1 | 1 | 17183 | 1540 | 117961 | 118343 | 123125 | 118536 | 118536 | 0.000915 | 0.000305 | … | 0.120388 | 0.003815 | 0.002472 | 0.000366 | 0.003754 | 0.000580 | 0.000153 | 0.000153 | 0.000153 | 0.000366 |
2 | 1 | 36724 | 14457 | 118219 | 118220 | 117884 | 117879 | 267952 | 0.000061 | 0.000092 | … | 0.005615 | 0.000397 | 0.001556 | 0.000061 | 0.000397 | 0.001556 | 0.000061 | 0.005615 | 0.000061 | 0.000061 |
3 | 1 | 36135 | 5396 | 117961 | 118343 | 119993 | 118321 | 240983 | 0.000031 | 0.001892 | … | 0.120388 | 0.005401 | 0.125057 | 0.036956 | 0.005035 | 0.022460 | 0.007782 | 0.003052 | 0.001770 | 0.016204 |
4 | 1 | 42680 | 5905 | 117929 | 117930 | 119569 | 119323 | 123932 | 0.000244 | 0.000275 | … | 0.004211 | 0.000671 | 0.000488 | 0.000305 | 0.000549 | 0.000244 | 0.000244 | 0.000183 | 0.000183 | 0.000519 |
5 rows × 36 columns
- <code class="language-python hljs "><span class="hljs-comment"># 利用循环,得到每三个自变量同时出现的频率,赋值到新的列中。</span>
- three = [<span class="hljs-string">"RESOURCE"</span>,<span class="hljs-string">"MGR_ID"</span>,<span class="hljs-string">"ROLE_ROLLUP_1"</span>,<span class="hljs-string">"ROLE_ROLLUP_2"</span>,<span class="hljs-string">"ROLE_DEPTNAME"</span>,<span class="hljs-string">"ROLE_TITLE"</span>,<span class="hljs-string">"ROLE_FAMILY_DESC"</span>]
- <span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> range(<span class="hljs-number">0</span>,len(three)):
- <span class="hljs-keyword">for</span> j <span class="hljs-keyword">in</span> range(i+<span class="hljs-number">1</span>,len(three)):
- <span class="hljs-keyword">for</span> k <span class="hljs-keyword">in</span> range(j+<span class="hljs-number">1</span>,len(three)):
- a = data[three[i]]*<span class="hljs-number">100000</span>*<span class="hljs-number">100000</span>+data[three[j]]*<span class="hljs-number">1000000</span>+data[three[k]]
- b = a.value_counts()/<span class="hljs-number">91690</span>
- a = a.map(b)
- data[three[i]+<span class="hljs-string">"_"</span>+three[j]+<span class="hljs-string">"_"</span>+three[k]+<span class="hljs-string">"_"</span>+<span class="hljs-string">"prob"</span>]=a
- data.head()</code>
ACTION | RESOURCE | MGR_ID | ROLE_ROLLUP_1 | ROLE_ROLLUP_2 | ROLE_DEPTNAME | ROLE_TITLE | ROLE_FAMILY_DESC | RESOURCE_prob | MGR_ID_prob | … | ROLE_ROLLUP_1_ROLE_ROLLUP_2_ROLE_DEPTNAME_prob | ROLE_ROLLUP_1_ROLE_ROLLUP_2_ROLE_TITLE_prob | ROLE_ROLLUP_1_ROLE_ROLLUP_2_ROLE_FAMILY_DESC_prob | ROLE_ROLLUP_1_ROLE_DEPTNAME_ROLE_TITLE_prob | ROLE_ROLLUP_1_ROLE_DEPTNAME_ROLE_FAMILY_DESC_prob | ROLE_ROLLUP_1_ROLE_TITLE_ROLE_FAMILY_DESC_prob | ROLE_ROLLUP_2_ROLE_DEPTNAME_ROLE_TITLE_prob | ROLE_ROLLUP_2_ROLE_DEPTNAME_ROLE_FAMILY_DESC_prob | ROLE_ROLLUP_2_ROLE_TITLE_ROLE_FAMILY_DESC_prob | ROLE_DEPTNAME_ROLE_TITLE_ROLE_FAMILY_DESC_prob | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 39353 | 85475 | 117961 | 118300 | 123472 | 117905 | 117906 | 0.000092 | 0.001678 | … | 0.000720 | 0.004951 | 0.011877 | 0.000185 | 0.000556 | 0.023220 | 0.000185 | 0.000556 | 0.003937 | 0.000218 |
1 | 1 | 17183 | 1540 | 117961 | 118343 | 123125 | 118536 | 118536 | 0.000915 | 0.000305 | … | 0.001341 | 0.000207 | 0.000055 | 0.000055 | 0.000055 | 0.000131 | 0.000055 | 0.000055 | 0.000055 | 0.000055 |
2 | 1 | 36724 | 14457 | 118219 | 118220 | 117884 | 117879 | 267952 | 0.000061 | 0.000092 | … | 0.000142 | 0.000556 | 0.000022 | 0.000055 | 0.000022 | 0.000022 | 0.000055 | 0.000022 | 0.000022 | 0.000022 |
3 | 1 | 36135 | 5396 | 117961 | 118343 | 119993 | 118321 | 240983 | 0.000031 | 0.001892 | … | 0.001800 | 0.008027 | 0.002781 | 0.001091 | 0.000633 | 0.005682 | 0.000971 | 0.000534 | 0.001451 | 0.000545 |
4 | 1 | 42680 | 5905 | 117929 | 117930 | 119569 | 119323 | 123932 | 0.000244 | 0.000275 | … | 0.000196 | 0.000087 | 0.000087 | 0.000044 | 0.000044 | 0.000109 | 0.000022 | 0.000022 | 0.000087 | 0.000065 |
5 rows × 71 columns
- <code class="language-python hljs "><span class="hljs-comment"># 利用循环,得到每三个自变量和RESOURCE同时出现的频率,赋值到新的列中。</span>
- four = [<span class="hljs-string">"RESOURCE"</span>,<span class="hljs-string">"MGR_ID"</span>,<span class="hljs-string">"ROLE_ROLLUP_1"</span>,<span class="hljs-string">"ROLE_ROLLUP_2"</span>,<span class="hljs-string">"ROLE_DEPTNAME"</span>,<span class="hljs-string">"ROLE_TITLE"</span>,<span class="hljs-string">"ROLE_FAMILY_DESC"</span>]
- <span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> range(<span class="hljs-number">1</span>,len(four)):
- <span class="hljs-keyword">for</span> j <span class="hljs-keyword">in</span> range(i+<span class="hljs-number">1</span>,len(four)):
- <span class="hljs-keyword">for</span> k <span class="hljs-keyword">in</span> range(j+<span class="hljs-number">1</span>,len(four)):
- a = data[four[<span class="hljs-number">0</span>]]*<span class="hljs-number">100000</span>*<span class="hljs-number">100000</span>+data[four[i]]*<span class="hljs-number">1000000</span>+data[four[j]]+data[four[k]]*<span class="hljs-number">0.000001</span>
- b = a.value_counts()/<span class="hljs-number">32769</span>
- a = a.map(b)
- data[four[<span class="hljs-number">0</span>]+<span class="hljs-string">"_"</span>+four[i]+<span class="hljs-string">"_"</span>+four[j]+<span class="hljs-string">"_"</span>+four[k]+<span class="hljs-string">"_"</span>+<span class="hljs-string">"prob"</span>]=a
- data.head()</code>
ACTION | RESOURCE | MGR_ID | ROLE_ROLLUP_1 | ROLE_ROLLUP_2 | ROLE_DEPTNAME | ROLE_TITLE | ROLE_FAMILY_DESC | RESOURCE_prob | MGR_ID_prob | … | RESOURCE_ROLE_ROLLUP_1_ROLE_ROLLUP_2_ROLE_DEPTNAME_prob | RESOURCE_ROLE_ROLLUP_1_ROLE_ROLLUP_2_ROLE_TITLE_prob | RESOURCE_ROLE_ROLLUP_1_ROLE_ROLLUP_2_ROLE_FAMILY_DESC_prob | RESOURCE_ROLE_ROLLUP_1_ROLE_DEPTNAME_ROLE_TITLE_prob | RESOURCE_ROLE_ROLLUP_1_ROLE_DEPTNAME_ROLE_FAMILY_DESC_prob | RESOURCE_ROLE_ROLLUP_1_ROLE_TITLE_ROLE_FAMILY_DESC_prob | RESOURCE_ROLE_ROLLUP_2_ROLE_DEPTNAME_ROLE_TITLE_prob | RESOURCE_ROLE_ROLLUP_2_ROLE_DEPTNAME_ROLE_FAMILY_DESC_prob | RESOURCE_ROLE_ROLLUP_2_ROLE_TITLE_ROLE_FAMILY_DESC_prob | RESOURCE_ROLE_DEPTNAME_ROLE_TITLE_ROLE_FAMILY_DESC_prob | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 39353 | 85475 | 117961 | 118300 | 123472 | 117905 | 117906 | 0.000092 | 0.001678 | … | 0.000092 | 0.000092 | 0.000092 | 0.000031 | 0.000031 | 0.000061 | 0.000031 | 0.000031 | 0.000061 | 0.000031 |
1 | 1 | 17183 | 1540 | 117961 | 118343 | 123125 | 118536 | 118536 | 0.000915 | 0.000305 | … | 0.000336 | 0.000305 | 0.000153 | 0.000031 | 0.000031 | 0.000031 | 0.000031 | 0.000031 | 0.000031 | 0.000031 |
2 | 1 | 36724 | 14457 | 118219 | 118220 | 117884 | 117879 | 267952 | 0.000061 | 0.000092 | … | 0.000061 | 0.000061 | 0.000031 | 0.000061 | 0.000031 | 0.000031 | 0.000061 | 0.000031 | 0.000031 | 0.000031 |
3 | 1 | 36135 | 5396 | 117961 | 118343 | 119993 | 118321 | 240983 | 0.000031 | 0.001892 | … | 0.000031 | 0.000031 | 0.000031 | 0.000031 | 0.000031 | 0.000031 | 0.000031 | 0.000031 | 0.000031 | 0.000031 |
4 | 1 | 42680 | 5905 | 117929 | 117930 | 119569 | 119323 | 123932 | 0.000244 | 0.000275 | … | 0.000061 | 0.000061 | 0.000061 | 0.000061 | 0.000061 | 0.000092 | 0.000031 | 0.000031 | 0.000061 | 0.000061 |
5 rows × 91 columns
- <code class="language-python hljs "><span class="hljs-comment"># RESOURCE 确定时其他单个变量同时发生的概率</span>
- resourcetwo = [<span class="hljs-string">‘RESOURCE_MGR_ID_prob‘</span>,<span class="hljs-string">‘RESOURCE_ROLE_ROLLUP_1_prob‘</span>, <span class="hljs-string">‘RESOURCE_ROLE_ROLLUP_2_prob‘</span>,<span class="hljs-string">‘RESOURCE_ROLE_DEPTNAME_prob‘</span>,
- <span class="hljs-string">‘RESOURCE_ROLE_TITLE_prob‘</span>,<span class="hljs-string">‘RESOURCE_ROLE_FAMILY_DESC_prob‘</span>]
- <span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> range(<span class="hljs-number">0</span>,len(resourcetwo)):
- a = data[resourcetwo[i]]/data.RESOURCE_prob
- data[resourcetwo[i]+<span class="hljs-string">"_"</span>+<span class="hljs-string">"probre"</span>]=a
- data.head()</code>
ACTION | RESOURCE | MGR_ID | ROLE_ROLLUP_1 | ROLE_ROLLUP_2 | ROLE_DEPTNAME | ROLE_TITLE | ROLE_FAMILY_DESC | RESOURCE_prob | MGR_ID_prob | … | RESOURCE_ROLE_ROLLUP_2_ROLE_DEPTNAME_ROLE_TITLE_prob | RESOURCE_ROLE_ROLLUP_2_ROLE_DEPTNAME_ROLE_FAMILY_DESC_prob | RESOURCE_ROLE_ROLLUP_2_ROLE_TITLE_ROLE_FAMILY_DESC_prob | RESOURCE_ROLE_DEPTNAME_ROLE_TITLE_ROLE_FAMILY_DESC_prob | RESOURCE_MGR_ID_prob_probre | RESOURCE_ROLE_ROLLUP_1_prob_probre | RESOURCE_ROLE_ROLLUP_2_prob_probre | RESOURCE_ROLE_DEPTNAME_prob_probre | RESOURCE_ROLE_TITLE_prob_probre | RESOURCE_ROLE_FAMILY_DESC_prob_probre | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 39353 | 85475 | 117961 | 118300 | 123472 | 117905 | 117906 | 0.000092 | 0.001678 | … | 0.000031 | 0.000031 | 0.000061 | 0.000031 | 1.000000 | 1.000000 | 1.000000 | 0.333333 | 0.666667 | 1.000000 |
1 | 1 | 17183 | 1540 | 117961 | 118343 | 123125 | 118536 | 118536 | 0.000915 | 0.000305 | … | 0.000031 | 0.000031 | 0.000031 | 0.000031 | 0.033333 | 0.866667 | 0.366667 | 0.033333 | 0.033333 | 0.033333 |
2 | 1 | 36724 | 14457 | 118219 | 118220 | 117884 | 117879 | 267952 | 0.000061 | 0.000092 | … | 0.000061 | 0.000031 | 0.000031 | 0.000031 | 0.500000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 0.500000 |
3 | 1 | 36135 | 5396 | 117961 | 118343 | 119993 | 118321 | 240983 | 0.000031 | 0.001892 | … | 0.000031 | 0.000031 | 0.000031 | 0.000031 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 |
4 | 1 | 42680 | 5905 | 117929 | 117930 | 119569 | 119323 | 123932 | 0.000244 | 0.000275 | … | 0.000031 | 0.000031 | 0.000061 | 0.000061 | 0.250000 | 0.375000 | 0.250000 | 0.250000 | 0.500000 | 0.375000 |
5 rows × 97 columns
- <code class="language-python hljs "><span class="hljs-comment"># 其他单个变量确定时RESOURCE变量同时发生的概率</span>
- resourcetwo = [<span class="hljs-string">‘RESOURCE_MGR_ID_prob‘</span>,<span class="hljs-string">‘RESOURCE_ROLE_ROLLUP_1_prob‘</span>, <span class="hljs-string">‘RESOURCE_ROLE_ROLLUP_2_prob‘</span>,<span class="hljs-string">‘RESOURCE_ROLE_DEPTNAME_prob‘</span>,
- <span class="hljs-string">‘RESOURCE_ROLE_TITLE_prob‘</span>,<span class="hljs-string">‘RESOURCE_ROLE_FAMILY_DESC_prob‘</span>]
- resourceone = [ <span class="hljs-string">‘MGR_ID_prob‘</span>, <span class="hljs-string">‘ROLE_ROLLUP_1_prob‘</span>,<span class="hljs-string">‘ROLE_ROLLUP_2_prob‘</span>, <span class="hljs-string">‘ROLE_DEPTNAME_prob‘</span>, <span class="hljs-string">‘ROLE_TITLE_prob‘</span>,<span class="hljs-string">‘ROLE_FAMILY_DESC_prob‘</span>]
- <span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> range(<span class="hljs-number">0</span>,len(resourcetwo)):
- a = data[resourcetwo[i]]/data[resourceone[i]]
- data[resourcetwo[i]+<span class="hljs-string">"_"</span>+<span class="hljs-string">"proboth"</span>]=a
- data.head()</code>
ACTION | RESOURCE | MGR_ID | ROLE_ROLLUP_1 | ROLE_ROLLUP_2 | ROLE_DEPTNAME | ROLE_TITLE | ROLE_FAMILY_DESC | RESOURCE_prob | MGR_ID_prob | … | RESOURCE_ROLE_ROLLUP_2_prob_probre | RESOURCE_ROLE_DEPTNAME_prob_probre | RESOURCE_ROLE_TITLE_prob_probre | RESOURCE_ROLE_FAMILY_DESC_prob_probre | RESOURCE_MGR_ID_prob_proboth | RESOURCE_ROLE_ROLLUP_1_prob_proboth | RESOURCE_ROLE_ROLLUP_2_prob_proboth | RESOURCE_ROLE_DEPTNAME_prob_proboth | RESOURCE_ROLE_TITLE_prob_proboth | RESOURCE_ROLE_FAMILY_DESC_prob_proboth | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 39353 | 85475 | 117961 | 118300 | 123472 | 117905 | 117906 | 0.000092 | 0.001678 | … | 1.000000 | 0.333333 | 0.666667 | 1.000000 | 0.054545 | 0.000140 | 0.000678 | 0.013889 | 0.000558 | 0.000435 |
1 | 1 | 17183 | 1540 | 117961 | 118343 | 123125 | 118536 | 118536 | 0.000915 | 0.000305 | … | 0.366667 | 0.033333 | 0.033333 | 0.033333 | 0.100000 | 0.001215 | 0.002788 | 0.006289 | 0.012346 | 0.083333 |
2 | 1 | 36724 | 14457 | 118219 | 118220 | 117884 | 117879 | 267952 | 0.000061 | 0.000092 | … | 1.000000 | 1.000000 | 1.000000 | 0.500000 | 0.333333 | 0.010870 | 0.010870 | 0.003663 | 0.001592 | 0.030303 |
3 | 1 | 36135 | 5396 | 117961 | 118343 | 119993 | 118321 | 240983 | 0.000031 | 0.001892 | … | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 0.016129 | 0.000047 | 0.000253 | 0.005263 | 0.000215 | 0.000804 |
4 | 1 | 42680 | 5905 | 117929 | 117930 | 119569 | 119323 | 123932 | 0.000244 | 0.000275 | … | 0.250000 | 0.250000 | 0.500000 | 0.375000 | 0.222222 | 0.010870 | 0.014493 | 0.044444 | 0.053333 | 0.157895 |
5 rows × 103 columns
- <code class="language-python hljs "><span class="hljs-comment"># RESOURCE 确定时其他两个变量同时发生的概率</span>
- resourcethree = [ <span class="hljs-string">‘RESOURCE_MGR_ID_ROLE_ROLLUP_1_prob‘</span>,<span class="hljs-string">‘RESOURCE_MGR_ID_ROLE_ROLLUP_2_prob‘</span>, <span class="hljs-string">‘RESOURCE_MGR_ID_ROLE_DEPTNAME_prob‘</span>,
- <span class="hljs-string">‘RESOURCE_MGR_ID_ROLE_TITLE_prob‘</span>,<span class="hljs-string">‘RESOURCE_MGR_ID_ROLE_FAMILY_DESC_prob‘</span>, <span class="hljs-string">‘RESOURCE_ROLE_ROLLUP_1_ROLE_ROLLUP_2_prob‘</span>,
- <span class="hljs-string">‘RESOURCE_ROLE_ROLLUP_1_ROLE_DEPTNAME_prob‘</span>,<span class="hljs-string">‘RESOURCE_ROLE_ROLLUP_1_ROLE_TITLE_prob‘</span>,<span class="hljs-string">‘RESOURCE_ROLE_ROLLUP_1_ROLE_FAMILY_DESC_prob‘</span>,
- <span class="hljs-string">‘RESOURCE_ROLE_ROLLUP_2_ROLE_DEPTNAME_prob‘</span>,<span class="hljs-string">‘RESOURCE_ROLE_ROLLUP_2_ROLE_TITLE_prob‘</span>, <span class="hljs-string">‘RESOURCE_ROLE_ROLLUP_2_ROLE_FAMILY_DESC_prob‘</span>,
- <span class="hljs-string">‘RESOURCE_ROLE_DEPTNAME_ROLE_TITLE_prob‘</span>, <span class="hljs-string">‘RESOURCE_ROLE_DEPTNAME_ROLE_FAMILY_DESC_prob‘</span>,<span class="hljs-string">‘RESOURCE_ROLE_TITLE_ROLE_FAMILY_DESC_prob‘</span>]
- <span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> range(<span class="hljs-number">0</span>,len(resourcethree)):
- a = data[resourcethree[i]]/data.RESOURCE_prob
- data[resourcethree[i]+<span class="hljs-string">"_"</span>+<span class="hljs-string">"probre"</span>]=a
- data.head()</code>
ACTION | RESOURCE | MGR_ID | ROLE_ROLLUP_1 | ROLE_ROLLUP_2 | ROLE_DEPTNAME | ROLE_TITLE | ROLE_FAMILY_DESC | RESOURCE_prob | MGR_ID_prob | … | RESOURCE_ROLE_ROLLUP_1_ROLE_ROLLUP_2_prob_probre | RESOURCE_ROLE_ROLLUP_1_ROLE_DEPTNAME_prob_probre | RESOURCE_ROLE_ROLLUP_1_ROLE_TITLE_prob_probre | RESOURCE_ROLE_ROLLUP_1_ROLE_FAMILY_DESC_prob_probre | RESOURCE_ROLE_ROLLUP_2_ROLE_DEPTNAME_prob_probre | RESOURCE_ROLE_ROLLUP_2_ROLE_TITLE_prob_probre | RESOURCE_ROLE_ROLLUP_2_ROLE_FAMILY_DESC_prob_probre | RESOURCE_ROLE_DEPTNAME_ROLE_TITLE_prob_probre | RESOURCE_ROLE_DEPTNAME_ROLE_FAMILY_DESC_prob_probre | RESOURCE_ROLE_TITLE_ROLE_FAMILY_DESC_prob_probre | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 39353 | 85475 | 117961 | 118300 | 123472 | 117905 | 117906 | 0.000092 | 0.001678 | … | 0.357389 | 0.119130 | 0.238259 | 0.357389 | 0.119130 | 0.238259 | 0.357389 | 0.119130 | 0.119130 | 0.238259 |
1 | 1 | 17183 | 1540 | 117961 | 118343 | 123125 | 118536 | 118536 | 0.000915 | 0.000305 | … | 0.131043 | 0.011913 | 0.011913 | 0.011913 | 0.011913 | 0.011913 | 0.011913 | 0.011913 | 0.011913 | 0.011913 |
2 | 1 | 36724 | 14457 | 118219 | 118220 | 117884 | 117879 | 267952 | 0.000061 | 0.000092 | … | 0.357389 | 0.357389 | 0.357389 | 0.178695 | 0.357389 | 0.357389 | 0.178695 | 0.357389 | 0.178695 | 0.178695 |
3 | 1 | 36135 | 5396 | 117961 | 118343 | 119993 | 118321 | 240983 | 0.000031 | 0.001892 | … | 0.357389 | 0.357389 | 0.357389 | 0.357389 | 0.357389 | 0.357389 | 0.357389 | 0.357389 | 0.357389 | 0.357389 |
4 | 1 | 42680 | 5905 | 117929 | 117930 | 119569 | 119323 | 123932 | 0.000244 | 0.000275 | … | 0.089347 | 0.089347 | 0.134021 | 0.134021 | 0.044674 | 0.089347 | 0.089347 | 0.089347 | 0.089347 | 0.134021 |
5 rows × 118 columns
- <code class="language-python hljs "><span class="hljs-comment"># 其他两个变量确定时RESOURCE变量同时发生的概率</span>
- resourcethree = [ <span class="hljs-string">‘RESOURCE_MGR_ID_ROLE_ROLLUP_1_prob‘</span>,<span class="hljs-string">‘RESOURCE_MGR_ID_ROLE_ROLLUP_2_prob‘</span>, <span class="hljs-string">‘RESOURCE_MGR_ID_ROLE_DEPTNAME_prob‘</span>,
- <span class="hljs-string">‘RESOURCE_MGR_ID_ROLE_TITLE_prob‘</span>,<span class="hljs-string">‘RESOURCE_MGR_ID_ROLE_FAMILY_DESC_prob‘</span>, <span class="hljs-string">‘RESOURCE_ROLE_ROLLUP_1_ROLE_ROLLUP_2_prob‘</span>,
- <span class="hljs-string">‘RESOURCE_ROLE_ROLLUP_1_ROLE_DEPTNAME_prob‘</span>,<span class="hljs-string">‘RESOURCE_ROLE_ROLLUP_1_ROLE_TITLE_prob‘</span>,<span class="hljs-string">‘RESOURCE_ROLE_ROLLUP_1_ROLE_FAMILY_DESC_prob‘</span>,
- <span class="hljs-string">‘RESOURCE_ROLE_ROLLUP_2_ROLE_DEPTNAME_prob‘</span>,<span class="hljs-string">‘RESOURCE_ROLE_ROLLUP_2_ROLE_TITLE_prob‘</span>, <span class="hljs-string">‘RESOURCE_ROLE_ROLLUP_2_ROLE_FAMILY_DESC_prob‘</span>,
- <span class="hljs-string">‘RESOURCE_ROLE_DEPTNAME_ROLE_TITLE_prob‘</span>, <span class="hljs-string">‘RESOURCE_ROLE_DEPTNAME_ROLE_FAMILY_DESC_prob‘</span>,<span class="hljs-string">‘RESOURCE_ROLE_TITLE_ROLE_FAMILY_DESC_prob‘</span>]
- othertwo = [<span class="hljs-string">‘MGR_ID_ROLE_ROLLUP_1_prob‘</span>,<span class="hljs-string">‘MGR_ID_ROLE_ROLLUP_2_prob‘</span>,<span class="hljs-string">‘MGR_ID_ROLE_DEPTNAME_prob‘</span>, <span class="hljs-string">‘MGR_ID_ROLE_TITLE_prob‘</span>,
- <span class="hljs-string">‘MGR_ID_ROLE_FAMILY_DESC_prob‘</span>, <span class="hljs-string">‘ROLE_ROLLUP_1_ROLE_ROLLUP_2_prob‘</span>, <span class="hljs-string">‘ROLE_ROLLUP_1_ROLE_DEPTNAME_prob‘</span>, <span class="hljs-string">‘ROLE_ROLLUP_1_ROLE_TITLE_prob‘</span>,
- <span class="hljs-string">‘ROLE_ROLLUP_1_ROLE_FAMILY_DESC_prob‘</span>, <span class="hljs-string">‘ROLE_ROLLUP_2_ROLE_DEPTNAME_prob‘</span>, <span class="hljs-string">‘ROLE_ROLLUP_2_ROLE_TITLE_prob‘</span>,
- <span class="hljs-string">‘ROLE_ROLLUP_2_ROLE_FAMILY_DESC_prob‘</span>, <span class="hljs-string">‘ROLE_DEPTNAME_ROLE_TITLE_prob‘</span>,<span class="hljs-string">‘ROLE_DEPTNAME_ROLE_FAMILY_DESC_prob‘</span>,
- <span class="hljs-string">‘ROLE_TITLE_ROLE_FAMILY_DESC_prob‘</span>]
- <span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> range(<span class="hljs-number">0</span>,len(resourcethree)):
- a = data[resourcethree[i]]/data[othertwo[i]]
- data[othertwo[i]+<span class="hljs-string">"_"</span>+<span class="hljs-string">"proboth"</span>]=a
- data.head()</code>
ACTION | RESOURCE | MGR_ID | ROLE_ROLLUP_1 | ROLE_ROLLUP_2 | ROLE_DEPTNAME | ROLE_TITLE | ROLE_FAMILY_DESC | RESOURCE_prob | MGR_ID_prob | … | ROLE_ROLLUP_1_ROLE_ROLLUP_2_prob_proboth | ROLE_ROLLUP_1_ROLE_DEPTNAME_prob_proboth | ROLE_ROLLUP_1_ROLE_TITLE_prob_proboth | ROLE_ROLLUP_1_ROLE_FAMILY_DESC_prob_proboth | ROLE_ROLLUP_2_ROLE_DEPTNAME_prob_proboth | ROLE_ROLLUP_2_ROLE_TITLE_prob_proboth | ROLE_ROLLUP_2_ROLE_FAMILY_DESC_prob_proboth | ROLE_DEPTNAME_ROLE_TITLE_prob_proboth | ROLE_DEPTNAME_ROLE_FAMILY_DESC_prob_proboth | ROLE_TITLE_ROLE_FAMILY_DESC_prob_proboth | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 39353 | 85475 | 117961 | 118300 | 123472 | 117905 | 117906 | 0.000092 | 0.001678 | … | 0.000242 | 0.005415 | 0.000245 | 0.000181 | 0.005415 | 0.001574 | 0.000985 | 0.016245 | 0.006498 | 0.000274 |
1 | 1 | 17183 | 1540 | 117961 | 118343 | 123125 | 118536 | 118536 | 0.000915 | 0.000305 | … | 0.000997 | 0.002859 | 0.004412 | 0.029782 | 0.002906 | 0.018810 | 0.071478 | 0.071478 | 0.071478 | 0.029782 |
2 | 1 | 36724 | 14457 | 118219 | 118220 | 117884 | 117879 | 267952 | 0.000061 | 0.000092 | … | 0.003885 | 0.054983 | 0.014015 | 0.178695 | 0.054983 | 0.014015 | 0.178695 | 0.003885 | 0.178695 | 0.178695 |
3 | 1 | 36135 | 5396 | 117961 | 118343 | 119993 | 118321 | 240983 | 0.000031 | 0.001892 | … | 0.000091 | 0.002019 | 0.000087 | 0.000295 | 0.002166 | 0.000486 | 0.001402 | 0.003574 | 0.006162 | 0.000673 |
4 | 1 | 42680 | 5905 | 117929 | 117930 | 119569 | 119323 | 123932 | 0.000244 | 0.000275 | … | 0.005180 | 0.032490 | 0.067010 | 0.107217 | 0.019855 | 0.089347 | 0.089347 | 0.119130 | 0.119130 | 0.063069 |
5 rows × 133 columns
- <code class="language-python hljs "><span class="hljs-comment"># 划分测试集与训练集</span>
- <span class="hljs-keyword">from</span> sklearn.cross_validation <span class="hljs-keyword">import</span> train_test_split
- <span class="hljs-keyword">from</span> sklearn.ensemble <span class="hljs-keyword">import</span> RandomForestClassifier
- <span class="hljs-keyword">from</span> sklearn.metrics <span class="hljs-keyword">import</span> confusion_matrix, roc_curve,roc_auc_score,classification_report
- y = data.ACTION
- X = data
- <span class="hljs-keyword">del</span> X[<span class="hljs-string">"ACTION"</span>]
- X_train, X_test, y_train, y_test = train_test_split(
- X, y, test_size=<span class="hljs-number">0.3</span>, random_state=<span class="hljs-number">0</span>)</code>
- <code class="language-python hljs "><span class="hljs-comment"># 利用以上处理所得的共133个自变量建立随机森林模型</span>
- forest = RandomForestClassifier(criterion=<span class="hljs-string">‘entropy‘</span>,
- n_estimators=<span class="hljs-number">1000</span>,
- random_state=<span class="hljs-number">1</span>,
- n_jobs=<span class="hljs-number">2</span>)
- RFfit = forest.fit(X_train , y_train)</code>
- <code class="language-python hljs "><span class="hljs-comment"># 利用模型进行预测</span>
- preds = RFfit.predict(X_test)</code>
- <code class="language-python hljs "><span class="hljs-comment"># 得到模型的混淆矩阵如下所示</span>
- confusion_matrix(y_test,preds)</code>
- <code>array([[ 138, 420],
- [ 59, 9214]])
- </code>
- <code class="language-python hljs "><span class="hljs-comment"># 得到模型的ROC_AUC得分如下所示</span>
- pre = RFfit.predict_proba(X_test)
- roc_auc_score(y_test,pre[:,<span class="hljs-number">1</span>])</code>
- <code>0.8639483844684166
- </code>
- <code class="language-python hljs "><span class="hljs-comment"># 得到摸型的ROC曲线如下所示</span>
- fpr,tpr,thresholds = roc_curve(y_test,pre[:,<span class="hljs-number">1</span>])
- fig,ax = plt.subplots(figsize=(<span class="hljs-number">8</span>,<span class="hljs-number">5</span>))
- plt.plot(fpr,tpr)
- ax.set_title(<span class="hljs-string">"Roc of Logistic Randomforest"</span>)</code>
- <code><matplotlib.text.Text at 0x26395198>
- </code>
利用Kaggle测试集得分为0.89,说明模型具有一定的效果。
Amazon Employee Access 数据分析报告
标签: