您当前的位置:首页 > 电脑百科 > 程序开发 > 语言 > Python

Python数据分析,清洗数据 7 步走

时间:2020-08-07 10:12:21  来源:  作者:

数据清洗 (data cleaning) 是机器学习和深度学习进入算法步前的一项重要任务,我平时比较习惯使用的 7 个步骤,总结如下:

  • Step1 : read csv
  • Step2 : preview data
  • Step3: check null value for every column
  • Step4: complete null value
  • Step5: feature engineering
  • Step 5.1: delete some features
  • Step 5.2: create new feature
  • Step6: encode for categories columns
  • Step 6.1: Sklearn LabelEncode
  • Step 6.2: Pandas get_dummies
  • Step 7: check for data cleaning

今天使用泰坦尼克数据集,完整介绍以上 7 步的具体操作过程。

1 读入数据

这不废话吗,第一步就是读入数据。

data_raw = pd.read_csv('../input/titanicdataset-traincsv/train.csv')
data_raw

结果:

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	0	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	0	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	0	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	0	373450	8.0500	NaN	S
...	...	...	...	...	...	...	...	...	...	...	...	...
886	887	0	2	Montvila, Rev. Juozas	male	27.0	0	0	211536	13.0000	NaN	S
887	888	1	1	Graham, Miss. Margaret Edith	female	19.0	0	0	112053	30.0000	B42	S
888	889	0	3	Johnston, Miss. Catherine Helen "Carrie"	female	NaN	1	2	W./C. 6607	23.4500	NaN	S
889	890	1	1	Behr, Mr. Karl Howell	male	26.0	0	0	111369	30.0000	C148	C
890	891	0	3	Dooley, Mr. Patrick	male	32.0	0	0	370376	7.7500	NaN	Q
891 rows × 12 columns

2 数据预览

data_raw.info()
data_raw.describe(include='all')

结果:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
count	891.000000	891.000000	891.000000	891	891	714.000000	891.000000	891.000000	891	891.000000	204	889
unique	NaN	NaN	NaN	891	2	NaN	NaN	NaN	681	NaN	147	3
top	NaN	NaN	NaN	Hakkarainen, Mr. Pekka Pietari	male	NaN	NaN	NaN	1601	NaN	G6	S
freq	NaN	NaN	NaN	1	577	NaN	NaN	NaN	7	NaN	4	644
mean	446.000000	0.383838	2.308642	NaN	NaN	29.699118	0.523008	0.381594	NaN	32.204208	NaN	NaN
std	257.353842	0.486592	0.836071	NaN	NaN	14.526497	1.102743	0.806057	NaN	49.693429	NaN	NaN
min	1.000000	0.000000	1.000000	NaN	NaN	0.420000	0.000000	0.000000	NaN	0.000000	NaN	NaN
25%	223.500000	0.000000	2.000000	NaN	NaN	20.125000	0.000000	0.000000	NaN	7.910400	NaN	NaN
50%	446.000000	0.000000	3.000000	NaN	NaN	28.000000	0.000000	0.000000	NaN	14.454200	NaN	NaN
75%	668.500000	1.000000	3.000000	NaN	NaN	38.000000	1.000000	0.000000	NaN	31.000000	NaN	NaN
max	891.000000	1.000000	3.000000	NaN	NaN	80.000000	8.000000	6.000000	NaN	512.329200	NaN	N

3 检查null值

data1 = data_raw.copy(deep=True)

data1.isnull().sum()

结果:

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

Age 列 177 个空值,Cabin 687 个空值,一共才 891 行,估计没啥价值了!Embarked 2 个。

4 补全空值

data1['Age'].fillna(data1['Age'].median(), inplace = True)
data1['Embarked'].fillna(data1['Embarked'].mode()[0], inplace = True)

data1.isnull().sum()

补全操作check:

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age              0
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         0
dtype: int64

5 特征工程

5.1 干掉 3 列:

drop_column = ['PassengerId','Cabin', 'Ticket']
data1.drop(drop_column, axis=1, inplace = True)

5.2 增加 3 列

增加一列FamilySize

data1['FamilySize'] = data1 ['SibSp'] + data1['Parch'] + 1
data1

打印结果:

Survived	Pclass	Name	Sex	Age	SibSp	Parch	Fare	Embarked	FamilySize
0	0	3	Braund, Mr. Owen Harris	male	22.0	1	0	7.2500	S	2
1	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	0	71.2833	C	2
2	1	3	Heikkinen, Miss. Laina	female	26.0	0	0	7.9250	S	1
3	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	0	53.1000	S	2
4	0	3	Allen, Mr. William Henry	male	35.0	0	0	8.0500	S	1
...	...	...	...	...	...	...	...	...	...	...
886	0	2	Montvila, Rev. Juozas	male	27.0	0	0	13.0000	S	1
887	1	1	Graham, Miss. Margaret Edith	female	19.0	0	0	30.0000	S	1
888	0	3	Johnston, Miss. Catherine Helen "Carrie"	female	28.0	1	2	23.4500	S	4
889	1	1	Behr, Mr. Karl Howell	male	26.0	0	0	30.0000	C	1
890	0	3	Dooley, Mr. Patrick	male	32.0	0	0	7.7500	Q	1
891 rows × 10 columns

再创建一列:

data1['IsAlone'] = np.where(data1['FamilySize'] > 1,0,1)

再创建一列:

data1['Title'] = data1['Name'].str.split(", ", expand=True)[1].str.split(".", expand=True)[0]
data1

结果:

	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Fare	Embarked	FamilySize	IsAlone	Title
0	0	3	Braund, Mr. Owen Harris	male	22.0	1	0	7.2500	S	2	0	Mr
1	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	0	71.2833	C	2	0	Mrs
2	1	3	Heikkinen, Miss. Laina	female	26.0	0	0	7.9250	S	1	1	Miss
3	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	0	53.1000	S	2	0	Mrs
4	0	3	Allen, Mr. William Henry	male	35.0	0	0	8.0500	S	1	1	Mr
...	...	...	...	...	...	...	...	...	...	...	...	...
886	0	2	Montvila, Rev. Juozas	male	27.0	0	0	13.0000	S	1	1	Rev
887	1	1	Graham, Miss. Margaret Edith	female	19.0	0	0	30.0000	S	1	1	Miss
888	0	3	Johnston, Miss. Catherine Helen "Carrie"	female	28.0	1	2	23.4500	S	4	0	Miss
889	1	1	Behr, Mr. Karl Howell	male	26.0	0	0	30.0000	C	1	1	Mr
890	0	3	Dooley, Mr. Patrick	male	32.0	0	0	7.7500	Q	1	1	Mr
891 rows × 12 columns

5.3 分箱走起

data1['FareCut'] = pd.qcut(data1['Fare'], 4)
data1['AgeCut'] = pd.cut(data1['Age'].astype(int), 6)
data1

结果:

	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Fare	Embarked	FamilySize	IsAlone	Title	FareCut	AgeCut
0	0	3	Braund, Mr. Owen Harris	male	22.0	1	0	7.2500	S	2	0	Mr	(-0.001, 7.91]	(13.333, 26.667]
1	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	0	71.2833	C	2	0	Mrs	(31.0, 512.329]	(26.667, 40.0]
2	1	3	Heikkinen, Miss. Laina	female	26.0	0	0	7.9250	S	1	1	Miss	(7.91, 14.454]	(13.333, 26.667]
3	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	0	53.1000	S	2	0	Mrs	(31.0, 512.329]	(26.667, 40.0]
4	0	3	Allen, Mr. William Henry	male	35.0	0	0	8.0500	S	1	1	Mr	(7.91, 14.454]	(26.667, 40.0]
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
886	0	2	Montvila, Rev. Juozas	male	27.0	0	0	13.0000	S	1	1	Rev	(7.91, 14.454]	(26.667, 40.0]
887	1	1	Graham, Miss. Margaret Edith	female	19.0	0	0	30.0000	S	1	1	Miss	(14.454, 31.0]	(13.333, 26.667]
888	0	3	Johnston, Miss. Catherine Helen "Carrie"	female	28.0	1	2	23.4500	S	4	0	Miss	(14.454, 31.0]	(26.667, 40.0]
889	1	1	Behr, Mr. Karl Howell	male	26.0	0	0	30.0000	C	1	1	Mr	(14.454, 31.0]	(13.333, 26.667]
890	0	3	Dooley, Mr. Patrick	male	32.0	0	0	7.7500	Q	1	1	Mr	(-0.001, 7.91]	(26.667, 40.0]
891 rows × 14 columns

6 编码

6.1 LabelEncoder 方法

使用 Sklearn 的 LabelEncoder

from sklearn.preprocessing import LabelEncoder

label = LabelEncoder()
data1['Sex_Code'] = label.fit_transform(data1['Sex'])
data1['Embarked_Code'] = label.fit_transform(data1['Embarked'])
data1['Title_Code'] = label.fit_transform(data1['Title'])
data1['AgeBin_Code'] = label.fit_transform(data1['AgeCut'])
data1['FareBin_Code'] = label.fit_transform(data1['FareCut'])
data1

结果 data1 选取某些列,算法模型终于能认出它们了,多不容易!

6.2 get_dummies 方法

get_dummies 将长 DataFrame 变为宽 DataFrame:

pd.get_dummies(data1['Sex'])

结果:

female	male
0	0	1
1	1	0
2	1	0
3	1	0
4	0	1
...	...	...
886	0	1
887	1	0
888	1	0
889	0	1
890	0	1
891 rows × 2 columns

而 LabelEncoder 编码后,仅仅是把 Female 编码为 0, male 编码为 1.

label.fit_transform(data1['Sex'])
0      1
1      0
2      0
3      0
4      1
      ..
886    1
887    0
888    0
889    1
890    1
Name: Sex_Code, Length: 891, dtype: int64

7 再 check

# Step 7: data cleaning check
data1[data1_x_alg].info()
print('-'*50)
data1_dummy.info()

结果:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 8 columns):
Sex_Code         891 non-null int64
Pclass           891 non-null int64
Embarked_Code    891 non-null int64
Title_Code       891 non-null int64
SibSp            891 non-null int64
Parch            891 non-null int64
Age              891 non-null float64
Fare             891 non-null float64
dtypes: float64(2), int64(6)
memory usage: 55.8 KB
--------------------------------------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 29 columns):
Pclass                891 non-null int64
SibSp                 891 non-null int64
Parch                 891 non-null int64
Age                   891 non-null float64
Fare                  891 non-null float64
FamilySize            891 non-null int64
IsAlone               891 non-null int64
Sex_female            891 non-null uint8
Sex_male              891 non-null uint8
Embarked_C            891 non-null uint8
Embarked_Q            891 non-null uint8
Embarked_S            891 non-null uint8
Title_Capt            891 non-null uint8
Title_Col             891 non-null uint8
Title_Don             891 non-null uint8
Title_Dr              891 non-null uint8
Title_Jonkheer        891 non-null uint8
Title_Lady            891 non-null uint8
Title_Major           891 non-null uint8
Title_Master          891 non-null uint8
Title_Miss            891 non-null uint8
Title_Mlle            891 non-null uint8
Title_Mme             891 non-null uint8
Title_Mr              891 non-null uint8
Title_Mrs             891 non-null uint8
Title_Ms              891 non-null uint8
Title_Rev             891 non-null uint8
Title_Sir             891 non-null uint8
Title_the Countess    891 non-null uint8
dtypes: float64(2), int64(5), uint8(22)
memory usage: 68.0 KB


Tags:Python数据分析   点击:()  评论:()
声明:本站部分内容及图片来自互联网,转载是出于传递更多信息之目的,内容观点仅代表作者本人,如有任何标注错误或版权侵犯请与我们联系(Email:2595517585@qq.com),我们将及时更正、删除,谢谢。
▌相关推荐
前言数据分析是通过明确分析目的,梳理并确定分析逻辑,针对性的收集、整理数据,并采用统计、挖掘技术分析,提取有用信息和展示结论的过程,是数据科学领域的核心技能。本篇文章有点...【详细内容】
2021-04-01  Tags: Python数据分析  点击:(301)  评论:(0)  加入收藏
一、数据来源本节选用的是Python的第三方库seaborn自带的数据集,该小费数据集为餐饮行业收集的数据,其中total_bill为消费总金额、tip为小费金额、sex为顾客性别、smoker为顾...【详细内容】
2020-08-17  Tags: Python数据分析  点击:(43)  评论:(0)  加入收藏
一、数据可视化概述数据可视化是在整个数据分析非常重要的一个辅助工具,可以清晰的理解数据,从而调整我们的分析方法。- 能将数据进行可视化,更直观的呈现- 使数据更加客观、...【详细内容】
2020-08-11  Tags: Python数据分析  点击:(57)  评论:(0)  加入收藏
本文的主要学习目标: 熟练的掌握 numpy 数组相关的运算; 熟练的使用 numpy 创建矩阵; 理解矩阵转置和乘法; 熟练的计算数据的相关系数、方差、协方差、标准差; 理解并能够计算特...【详细内容】
2020-08-11  Tags: Python数据分析  点击:(48)  评论:(0)  加入收藏
数据清洗 (data cleaning) 是机器学习和深度学习进入算法步前的一项重要任务,我平时比较习惯使用的 7 个步骤,总结如下: Step1 : read csv Step2 : preview data Step3: check...【详细内容】
2020-08-07  Tags: Python数据分析  点击:(54)  评论:(0)  加入收藏
前言使用Python进行数据分析是一件专业领域的事情,所以要想强化数据分析的技能,需要大家不断练习。同时,我们也需要向有经验的数据分析师学习他们的专业技巧。这篇文章我们介绍...【详细内容】
2020-07-26  Tags: Python数据分析  点击:(41)  评论:(0)  加入收藏
数据分析肯定需要数据,这个数据一般都是来自实际学习工作业务中的,比如学校的学生成绩,淘宝京东的销售数据,视频网站不同种类的视频播放点击量等。自己练习的话,除了可以去一些公...【详细内容】
2020-07-13  Tags: Python数据分析  点击:(70)  评论:(0)  加入收藏
来源:Python爱好者社区ID:python_shequ作者:深度沉迷学习 Python语言:简要概括一下Python语言在数据分析、挖掘场景中常用特性: 列表(可以被修改),元组(不可以被修改) 字典(<k,v>结构...【详细内容】
2019-09-25  Tags: Python数据分析  点击:(103)  评论:(0)  加入收藏
▌简易百科推荐
Python 是一个很棒的语言。它是世界上发展最快的编程语言之一。它一次又一次地证明了在开发人员职位中和跨行业的数据科学职位中的实用性。整个 Python 及其库的生态系统使...【详细内容】
2021-12-27  IT资料库    Tags:Python 库   点击:(1)  评论:(0)  加入收藏
菜单驱动程序简介菜单驱动程序是通过显示选项列表从用户那里获取输入并允许用户从选项列表中选择输入的程序。菜单驱动程序的一个简单示例是 ATM(自动取款机)。在交易的情况下...【详细内容】
2021-12-27  子冉爱python    Tags:Python   点击:(4)  评论:(0)  加入收藏
有不少同学学完Python后仍然很难将其灵活运用。我整理15个Python入门的小程序。在实践中应用Python会有事半功倍的效果。01 实现二元二次函数实现数学里的二元二次函数:f(x,...【详细内容】
2021-12-22  程序汪小成    Tags:Python入门   点击:(32)  评论:(0)  加入收藏
Verilog是由一个个module组成的,下面是其中一个module在网表中的样子,我只需要提取module名字、实例化关系。module rst_filter ( ...); 端口声明... wire定义......【详细内容】
2021-12-22  编程啊青    Tags:Verilog   点击:(7)  评论:(0)  加入收藏
运行环境 如何从 MP4 视频中提取帧 将帧变成 GIF 创建 MP4 到 GIF GUI ...【详细内容】
2021-12-22  修道猿    Tags:Python   点击:(5)  评论:(0)  加入收藏
面向对象:Object Oriented Programming,简称OOP,即面向对象程序设计。类(Class)和对象(Object)类是用来描述具有相同属性和方法对象的集合。对象是类的具体实例。比如,学生都有...【详细内容】
2021-12-22  我头秃了    Tags:python   点击:(9)  评论:(0)  加入收藏
所谓内置函数,就是Python提供的, 可以直接拿来直接用的函数,比如大家熟悉的print,range、input等,也有不是很熟,但是很重要的,如enumerate、zip、join等,Python内置的这些函数非常...【详细内容】
2021-12-21  程序员小新ds    Tags:python初   点击:(5)  评论:(0)  加入收藏
Hi,大家好。我们在接口自动化测试项目中,有时候需要一些加密。今天给大伙介绍Python实现各种 加密 ,接口加解密再也不愁。目录一、项目加解密需求分析六、Python加密库PyCrypto...【详细内容】
2021-12-21  Python可乐    Tags:Python   点击:(7)  评论:(0)  加入收藏
借助pyautogui库,我们可以轻松地控制鼠标、键盘以及进行图像识别,实现自动抢课的功能1.准备工作我们在仓库里提供了2个必须的文件,包括: auto_get_lesson_pic_recognize.py:脚本...【详细内容】
2021-12-17  程序员道道    Tags:python   点击:(13)  评论:(0)  加入收藏
前言越来越多开发者表示,自从用了Python/Pandas,Excel都没有打开过了,用Python来处理与可视化表格就是四个字&mdash;&mdash;非常快速!下面我来举几个明显的例子1.删除重复行和空...【详细内容】
2021-12-16  查理不是猹    Tags:Python   点击:(20)  评论:(0)  加入收藏
最新更新
栏目热门
栏目头条