Hive数据库操作
创建数据库
hive的数据库置默认存放在/user/hive/warehouse目录
create database myhive; create database if not exists myhive;
创建数据库-指定存储路径,location :用来指定数据库的存放目录
create database myhive location '/myhive';
查看数据库
show databases; # like过滤show databases like 'my*';
查看数据库详细信息
desc database myhive;
更多详细信息
desc database extended myhive;
切换数据库
use myhive;
删除数据库
删除一个空数据库,如果数据库下面有数据表,那么就会报错
drop database myhive;drop database if exists myhive;
强制删除数据库,包含数据库下面的表一起删除
drop database myhive cascade;rop database if exists myhive cascade;
修改数据库属性
使用ALTER DATABASE命令为某个数据库的DBPROPERTIES设置键-值对属性值,来描述这个数据库的属性信息
修改数据库的属性,但是元数据信息不可更改(数据库名称以及数据库所在位置)
alter database myhive2 set dbproperties('createtime'='20220202');
Hive数据类型
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types
基础数据类型
复制数据类型
Hive数据表操作
数据库编码问题
创建表出现异常:
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. MetaException(message:An exception was thrown while adding/validating class(es) : Column length too big for column 'PARAM_VALUE' (max = 21845); use BLOB or TEXT insteadcom.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: Column length too big for column 'PARAM_VALUE' (max = 21845); use BLOB or TEXT instead
异常解决:数据库一般默认编码是utf-8或utfmb4,修改hive数据库编码位latin1即可
语法格式
create database myhive location '/myhive';0
CREATE TABLE
create database myhive location '/myhive';1
EXTERNAL
create database myhive location '/myhive';2
COMMENT
create database myhive location '/myhive';3
PARTITIONED BY
create database myhive location '/myhive';4
CLUSTERED BY
create database myhive location '/myhive';5
SORTED BY
create database myhive location '/myhive';6
ROW FORMAT
create database myhive location '/myhive';7
STORED AS
create database myhive location '/myhive';8
LOCATION
create database myhive location '/myhive';9
LIKE
show databases; # like过滤show databases like 'my*';0
创建数据表
show databases; # like过滤show databases like 'my*';1
show databases; # like过滤show databases like 'my*';2
复杂表
show databases; # like过滤show databases like 'my*';3
查看数据表
show databases; # like过滤show databases like 'my*';4
查询表类型
show databases; # like过滤show databases like 'my*';5
查询表结构
show databases; # like过滤show databases like 'my*';6
插入数据
show databases; # like过滤show databases like 'my*';7
查询数据
show databases; # like过滤show databases like 'my*';8
修改数据表
修改表名称
show databases; # like过滤show databases like 'my*';9
添加列
desc database myhive;0
更新列
desc database myhive;1
删除数据表
desc database myhive;2
内部表与外部表
在删除表的时候,内部表的元数据和数据会被一起删除,而外部表只删除元数据,不删除数据。
外部表是把指定的hdfs路径的数据加载到表当中来,hive表不完全独占这份数据,所以删除hive表的时候,数据仍然存放在hdfs当中,不会删掉
创建内部表
创建内部表并指定字段之间的分隔符
desc database myhive;3
根据查询结果创建表(通过复制表结构和表内容创建新表)
create table tb_test2 as show databases; # like过滤show databases like 'my*';8
根据已经存在的表结构创建表
desc database myhive;5
创建外部表
创建外部表并指定字段之间的分隔符
desc database myhive;6
数据加载与导出
数据加载
load data [local] inpath 'datapath' [overwrite] into table tb_name [partition (partcol1=val1,…)];
desc database myhive;7
1.加载linux本地数据
加载数据(将原数据复制到HDFS表目录下)
desc database myhive;8
加载数据并覆盖已有数据
desc database myhive;9
2.加载HDFS数据
从hdfs文件系统向表中加载数据(从HDFS处剪切文件到HDFS表目录下)
desc database extended myhive;0
3.通过查询插入数据形式加载数据表
查询结果插入一张表
insert overwrite table tb_test1 show databases; # like过滤show databases like 'my*';8
查询结果一次性存放到多张表
desc database extended myhive;2
数据导出
1.导出到本地
创建数据存放目录
desc database extended myhive;3
将查询结果数据存放到本地
insert overwrite local directory '/var/hive/tb_test' show databases; # like过滤show databases like 'my*';8
按照指定的方式将数据输出到本地
insert overwrite local directory '/var/hive/tb_test'ROW FORMAT DELIMITED fields terminated by ','collection items terminated by '-'map keys terminated by ':'lines terminated by '\n'show databases; # like过滤show databases like 'my*';8
导出到HDFS
查询结果输出到HDFS
# 创建数据存放目录hdfs dfs -mkdir -p /lzj/copy# 导出查询结果数据insert overwrite directory '/hive/back/tb_test'ROW FORMAT DELIMITED FIELDS TERMINATED BY ','show databases; # like过滤show databases like 'my*';8
使用HDFS命令拷贝文件到其他目录
desc database extended myhive;7
将表结构和数据同时备份
# 导出数据export table tb_test to '/hive/back/tb_test';# 删除表结构desc database myhive;2# 恢复表结构和数据import from '/hive/back/tb_test';
分区表
把大的数据,按照一定规则(年,月,天,时)进行切分成一个个的小的文件,然后操作小的文件
创建分区表
创建一个表带1个分区,分区的字段不能和表的字段相同
desc database extended myhive;9
创建一个表带多个分区,前后两个分区的关系为父子关系
use myhive;0
加载数据到分区表
use myhive;1
use myhive;2
分区表的查询
分区表关联查询
use myhive;3
直接插入数据
use myhive;4
通过查询插入数据
load方式
use myhive;5
select方式
use myhive;6
查看分区情况
use myhive;7
添加一个分区
use myhive;8
删除分区
use myhive;9
动态分区
上述操作属于静态分区
静态分区与动态分区的主要区别在于静态分区是手动指定,而动态分区是通过数据来进行判断
静态分区的列是在编译时期通过用户传递来决定的;动态分区只有在SQL执行时才能决定
开启动态分区首先要在hive会话中设置开启
drop database myhive;drop database if exists myhive;0
其他参数:
drop database myhive;drop database if exists myhive;1
创建分区表
drop database myhive;drop database if exists myhive;2
创建外部表
drop database myhive;drop database if exists myhive;3
如果使静态分区,插入数据必须指定分区的值,如果有多个不同分区值则将执行多次
drop database myhive;drop database if exists myhive;4
如果使用动态分区,动态分区会根据select的结果自动判断数据应该load到那个分区
drop database myhive;drop database if exists myhive;5
分桶表
将数据按照指定的字段进行划分,分到多个桶(多个文件)中。
Hive采用对列值哈希,然后除以桶的个数求余的方式决定该条记录存放在哪个桶当中
开启分桶功能
drop database myhive;drop database if exists myhive;6
设置Reduce个数
drop database myhive;drop database if exists myhive;7
创建桶表
drop database myhive;drop database if exists myhive;8
创建普通表
创建普通表,通过insert overwrite的方式将普通表的数据通过查询的方式加载到桶表当中去
drop database myhive;drop database if exists myhive;9
普通表加载数据
drop database myhive cascade;rop database if exists myhive cascade;0
桶表加载数据
通过insert overwrite给桶表中加载数据
drop database myhive cascade;rop database if exists myhive cascade;1
数据抽样
数据块抽样: Hive随机抽取N行数据,数据总量的百分比(n百分比)或N字节的数据
drop database myhive cascade;rop database if exists myhive cascade;2
桶表抽样
drop database myhive cascade;rop database if exists myhive cascade;3
随机抽样
drop database myhive cascade;rop database if exists myhive cascade;4
Hive查询
语法格式
drop database myhive cascade;rop database if exists myhive cascade;5
order by
drop database myhive cascade;rop database if exists myhive cascade;6
sort by
drop database myhive cascade;rop database if exists myhive cascade;7
distribute by
drop database myhive cascade;rop database if exists myhive cascade;8
Cluster by
drop database myhive cascade;rop database if exists myhive cascade;9
查询语法
show databases; # like过滤show databases like 'my*';8select id,name from tb_test;
列别名
alter database myhive2 set dbproperties('createtime'='20220202');1
LIMIT
LIMIT语句,限制返回的行数
alter database myhive2 set dbproperties('createtime'='20220202');2
WHERE
WHERE语句,条件过滤
alter database myhive2 set dbproperties('createtime'='20220202');3
LIKE 和 RLIKE
LIKE与在MySQL中的用法一致
RLIKE子句是Hive中这个功能的一个扩展,可以通过Java正则表达式指定匹配条件
alter database myhive2 set dbproperties('createtime'='20220202');4
alter database myhive2 set dbproperties('createtime'='20220202');5
常用函数
总行数( count)
alter database myhive2 set dbproperties('createtime'='20220202');6
最大值( max)
alter database myhive2 set dbproperties('createtime'='20220202');7
最小值( min)
alter database myhive2 set dbproperties('createtime'='20220202');8
总和( sum)
alter database myhive2 set dbproperties('createtime'='20220202');9
平均值( avg)
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. MetaException(message:An exception was thrown while adding/validating class(es) : Column length too big for column 'PARAM_VALUE' (max = 21845); use BLOB or TEXT insteadcom.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: Column length too big for column 'PARAM_VALUE' (max = 21845); use BLOB or TEXT instead0
比较运算符
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. MetaException(message:An exception was thrown while adding/validating class(es) : Column length too big for column 'PARAM_VALUE' (max = 21845); use BLOB or TEXT insteadcom.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: Column length too big for column 'PARAM_VALUE' (max = 21845); use BLOB or TEXT instead1
逻辑运算符
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. MetaException(message:An exception was thrown while adding/validating class(es) : Column length too big for column 'PARAM_VALUE' (max = 21845); use BLOB or TEXT insteadcom.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: Column length too big for column 'PARAM_VALUE' (max = 21845); use BLOB or TEXT instead2
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. MetaException(message:An exception was thrown while adding/validating class(es) : Column length too big for column 'PARAM_VALUE' (max = 21845); use BLOB or TEXT insteadcom.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: Column length too big for column 'PARAM_VALUE' (max = 21845); use BLOB or TEXT instead3
分组
GROUP BY语句
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. MetaException(message:An exception was thrown while adding/validating class(es) : Column length too big for column 'PARAM_VALUE' (max = 21845); use BLOB or TEXT insteadcom.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: Column length too big for column 'PARAM_VALUE' (max = 21845); use BLOB or TEXT instead4
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. MetaException(message:An exception was thrown while adding/validating class(es) : Column length too big for column 'PARAM_VALUE' (max = 21845); use BLOB or TEXT insteadcom.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: Column length too big for column 'PARAM_VALUE' (max = 21845); use BLOB or TEXT instead5
HAVING 语句
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. MetaException(message:An exception was thrown while adding/validating class(es) : Column length too big for column 'PARAM_VALUE' (max = 21845); use BLOB or TEXT insteadcom.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: Column length too big for column 'PARAM_VALUE' (max = 21845); use BLOB or TEXT instead6
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. MetaException(message:An exception was thrown while adding/validating class(es) : Column length too big for column 'PARAM_VALUE' (max = 21845); use BLOB or TEXT insteadcom.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: Column length too big for column 'PARAM_VALUE' (max = 21845); use BLOB or TEXT instead7
JOIN语句
Hive支持通常的SQL JOIN语句,但是只支持等值连接,不支持非等值连接。
等值 JOIN
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. MetaException(message:An exception was thrown while adding/validating class(es) : Column length too big for column 'PARAM_VALUE' (max = 21845); use BLOB or TEXT insteadcom.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: Column length too big for column 'PARAM_VALUE' (max = 21845); use BLOB or TEXT instead8
内连接,连接两表中都存在与连接条件相匹配的数据将会被返回
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. MetaException(message:An exception was thrown while adding/validating class(es) : Column length too big for column 'PARAM_VALUE' (max = 21845); use BLOB or TEXT insteadcom.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: Column length too big for column 'PARAM_VALUE' (max = 21845); use BLOB or TEXT instead9
左外连接,左边表中符合WHERE子句的所有记录将会被返回
create database myhive location '/myhive';00
右外连接,右边表中符合WHERE子句的所有记录将会被返回
create database myhive location '/myhive';01
多表连接,至少需要n-1个连接条件
多数情况,Hive会对每对JOIN连接对象启动一个MapReduce任务
例如:首先启动一个MapReduce job对tb_a 和tb_b 进行连接操作,然后再启动一个MapReduce job将第一个MapReduce job的输出和tb_c 进行连接操作
create database myhive location '/myhive';02
排序
全局排序
Order By是全局排序,对应一个reduce,因此输入规模较大时,需要较长计算时间;
order by放在select语句的结尾;使用ORDER BY子句排序
create database myhive location '/myhive';03
create database myhive location '/myhive';04
别名排序
create database myhive location '/myhive';05
多列排序
create database myhive location '/myhive';06
Sort By局部排序
Sort By局部排序,是对每个MapReduce内部排序,数据进入reducer前完成排序
create database myhive location '/myhive';07
分区排序(DISTRIBUTE BY) distribute by(字段): 根据指定的字段将数据分到不同的reducer,且分发算法是hash散列
需要结合sort by使用,且distribute by要在sort by之前
create database myhive location '/myhive';08
分区并排序 CLUSTER BY
create database myhive location '/myhive';09
create database myhive location '/myhive';10
Hive函数
内置函数
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-Built-inFunctions
查看系统自带的函数
create database myhive location '/myhive';11
显示自带的函数的用法
create database myhive location '/myhive';12
详细显示自带的函数的用法
create database myhive location '/myhive';13
自定义函数
https://cwiki.apache.org/confluence/display/Hive/HivePlugins
当Hive内置函数无法满足业务需要时,可以自定义UDF来方便的扩展
自定义函数分三类:
create database myhive location '/myhive';14
添加依赖
create database myhive location '/myhive';15
编写函数
create database myhive location '/myhive';16
打包上传到hive
create database myhive location '/myhive';17
hive客户端添加jar包
create database myhive location '/myhive';18
设置函数与自定义函数关联
create database myhive location '/myhive';19
使用自定义函数
create database myhive location '/myhive';20原文:https://juejin.cn/post/7098144957932044325