Hive 介绍
hive的官方文档中有对hive的详细介绍,这里不再赘述。我们用一句话描述如下:
The Apache Hive ™ data warehouse software facilitates reading, writing, and managing large datasets residing in distributed storage using SQL. Structure can be projected onto data already in storage. A command line tool and JDBC driver are provided to connect users to Hive.
hive 的安装
安装过程参考这里:
https://cwiki.apache.org/confluence/display/Hive/GettingStarted
hive依赖于HADOOP,我们在上一篇hadoop 集群的搭建HA的基础上安装hive。
首先下载hive,我们下载的时候,要选择适合我们HADOOP版本的hive,我们下载的稳定版为apache-hive-1.2.2-bin.tar.gz,我们将在HADOOP集群的namenode上面安装,即在master机器上面安装。将压缩包传到/home/hadoop/
目录下。1
2$ cd /home/hadoop/
$ tar xzvf apache-hive-1.2.2-bin.tar.gz
解压后得到目录apache-hive-1.2.2-bin
,我们看下压缩包中的内容:1
2
3
4
5
6
7$ cd /home/hadoop/apache-hive-1.2.2-bin
$ ls
bin conf examples hcatalog lib LICENSE NOTICE README.txt RELEASE_NOTES.txt scripts
$
$ ls conf/
beeline-log4j.properties.template hive-env.sh.template hive-log4j.properties.template
hive-default.xml.template hive-exec-log4j.properties.template ivysettings.xml
配置HADOOP_HOME:1
2
3
4
5
6$ cd /home/hadoop/apache-hive-1.2.2-bin/conf/
$ cp hive-default.xml.template hive-site.xml
$ cp hive-env.sh.template hive-env.sh
$ vi hive-env.sh
HADOOP_HOME=/home/hadoop/hadoop-2.7.3
到这里,hive就配置好了,可以运行了。但,不妨看下下面的配置hive元数据的存储位置
,因为生产环境一般是要配置的。
配置hive元数据的存储位置(可选配置)
hive默认将元数据存储在derby
数据库中(hive安装包自带),当然我们也可以选择存储在其他数据库,如mysql中。下面演示一下:
首先在MYSQL数据库中创建一个数据库,用于存储hive的元数据,我们就将库名创建为hive:1
2
3
4mysql> CREATE DATABASE IF NOT EXISTS hive COLLATE = 'utf8_general_ci' CHARACTER SET = 'utf8';
mysql> GRANT ALL ON hive.* TO 'hive'@'%' IDENTIFIED BY 'hive';
mysql> GRANT ALL ON hive.* TO 'hive'@'localhost' IDENTIFIED BY 'hive';
mysql> FLUSH PRIVILEGES;
然后配置hive使用mysql存储元数据:1
2$ cd /home/hadoop/apache-hive-1.2.2-bin/conf/
$ vi hive-site.xml
修改下面部分,假定我们的数据库地址、用户名和密码如下:1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://mysql.hewentian.com:3306/hive</value>
<description>
JDBC connect string for a JDBC metastore.
To use SSL to encrypt/authenticate the connection, provide database-specific SSL flag in the connection URL.
For example, jdbc:postgresql://myhost/db?ssl=true for postgres database.
</description>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
<description>Driver class name for a JDBC metastore</description>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>hive</value>
<description>Username to use against metastore database</description>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>hive</value>
<description>password to use against metastore database</description>
</property>
最后,将mysql连接JDBC的jar包mysql-connector-java-5.1.42.jar放到apache-hive-1.2.2-bin/lib
目录下
好了,以上这部分是可选配置部分。
启动hive
初次启动hive,需在HDFS中创建几个目录,用于存储hive的数据,我们在安装hive的master节点执行如下命令:1
2
3
4
5
6$ cd /home/hadoop/hadoop-2.7.3/
$ ./bin/hdfs dfs -mkdir /tmp
$ ./bin/hdfs dfs -mkdir -p /user/hive/warehouse
$
$ ./bin/hdfs dfs -chmod g+w /tmp
$ ./bin/hdfs dfs -chmod g+w /user/hive/warehouse
初始化元数据存储相关信息,hive默认使用内置的derby
数据库存储元数据。这里使用mysql
,如果要使用默认的,则则将下面的mysql
修改成derby
即可。1
2
3
4
5
6
7
8
9
10$ cd /home/hadoop/apache-hive-1.2.2-bin/bin
$ ./schematool -dbType mysql -initSchema
Metastore connection URL: jdbc:mysql://mysql.hewentian.com:3306/hive
Metastore Connection Driver : com.mysql.jdbc.Driver
Metastore connection User: hive
Starting metastore schema initialization to 1.2.0
Initialization script hive-schema-1.2.0.mysql.sql
Initialization script completed
schemaTool completed
正式启动hive1
2$ cd /home/hadoop/apache-hive-1.2.2-bin/bin
$ ./hive
启动的时候可能会报如下错误:
Exception in thread "main" java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: ${system:java.io.tmpdir%7D/$%7Bsystem:user.name%7D
at org.apache.hadoop.fs.Path.initialize(Path.java:205)
at org.apache.hadoop.fs.Path.<init>(Path.java:171)
at org.apache.hadoop.hive.ql.session.SessionState.createSessionDirs(SessionState.java:659)
at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:582)
at org.apache.hadoop.hive.ql.session.SessionState.beginStart(SessionState.java:549)
at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:750)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:686)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
Caused by: java.net.URISyntaxException: Relative path in absolute URI: ${system:java.io.tmpdir%7D/$%7Bsystem:user.name%7D
at java.net.URI.checkPath(URI.java:1823)
at java.net.URI.<init>(URI.java:745)
at org.apache.hadoop.fs.Path.initialize(Path.java:202)
... 12 more
解决方法如下,先建目录:1
2$ cd /home/hadoop/apache-hive-1.2.2-bin/
$ mkdir iotmp
将hive-site.xml
中
- 包含
${system:java.io.tmpdir}
的配置项替换为上面的路径/home/hadoop/apache-hive-1.2.2-bin/iotmp
,一共有4处; - 包含
${system:user.name}
的配置项替换为hadoop
。
修改项如下:1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20<property>
<name>hive.exec.local.scratchdir</name>
<value>/home/hadoop/apache-hive-1.2.2-bin/iotmp/hadoop</value>
<description>Local scratch space for Hive jobs</description>
</property>
<property>
<name>hive.downloaded.resources.dir</name>
<value>/home/hadoop/apache-hive-1.2.2-bin/iotmp/${hive.session.id}_resources</value>
<description>Temporary local directory for added resources in the remote file system.</description>
</property>
<property>
<name>hive.querylog.location</name>
<value>/home/hadoop/apache-hive-1.2.2-bin/iotmp/hadoop</value>
<description>Location of Hive run time structured log file</description>
</property>
<property>
<name>hive.server2.logging.operation.log.location</name>
<value>/home/hadoop/apache-hive-1.2.2-bin/iotmp/hadoop/operation_logs</value>
<description>Top level directory where operation logs are stored if logging functionality is enabled</description>
</property>
重新启动hive:1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17$ cd /home/hadoop/apache-hive-1.2.2-bin/bin
$ ./hive
Logging initialized using configuration in jar:file:/home/hadoop/apache-hive-1.2.2-bin/lib/hive-common-1.2.2.jar!/hive-log4j.properties
hive> show databases;
OK
default
Time taken: 0.821 seconds, Fetched: 1 row(s)
hive>
> use default;
OK
Time taken: 0.043 seconds
hive>
> show tables;
OK
Time taken: 0.094 seconds
hive>
至此,hive安装成功。从上面可知,hive有一个默认的数据库default
,并且里面一张表也没有。
hive初体验
A longer tutorial that covers more features of HiveQL:
https://cwiki.apache.org/confluence/display/Hive/TutorialThe HiveQL Language Manual:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual
创建数据库:
1 | hive> CREATE DATABASE IF NOT EXISTS tim; |
同样,我们可以在HDFS中查看到:1
2
3
4$ cd /home/hadoop/hadoop-2.7.3/
$ ./bin/hdfs dfs -ls /user/hive/warehouse
Found 1 items
drwxrwxr-x - hadoop supergroup 0 2019-01-01 19:32 /user/hive/warehouse/tim.db
创建表
1 | hive> use tim; |
查看表结构
1 | hive> desc t_user; |
插入数据
1 | hive> INSERT INTO TABLE t_user(id, name, age, sex, birthday, address) VALUES(1, 'Tim Ho', 23, 'M', '1989-05-01', 'Higher Education Mega Center South, Guangzhou city, Guangdong Province'); |
执行插入操作它会产生一个mapReduce任务。
查询数据
1 | hive> select * from t_user; |
由上面可知,执行简单的查询操作不会启动mapReduce,但执行像COUNT这样的统计操作将会产生一个mapReduce。
从文件中导入数据
语法:
LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 ...)]
我们可以按定义表结构时的使用的字段分隔符(\t),将数据存放在文本文件里,然后使用LOAD命令来导入。例如我们将数据存放在/home/hadoop/user.txt
中:1
22 scott 25 M 1977-10-21 USA
3 tiger 21 F 1977-08-12 UK
然后在hive中执行LOAD命令:1
2
3
4
5
6
7
8
9
10
11
12
13hive> LOAD DATA LOCAL INPATH '/home/hadoop/user.txt' INTO TABLE t_user;
Loading data to table tim.t_user
Table tim.t_user stats: [numFiles=2, numRows=0, totalSize=151, rawDataSize=0]
OK
Time taken: 0.214 seconds
hive>
> select * from t_user;
OK
1 Tim Ho 23 M 1989-05-01 Higher Education Mega Center South, Guangzhou city, Guangdong Province
2 scott 25 M 1977-10-21 USA
3 tiger 21 F 1977-08-12 UK
Time taken: 0.085 seconds, Fetched: 3 row(s)
hive>
通过JAVA代码操作hive
HQL脚本通常有以下几种方式执行:
- hive -e “hql”;
- hive -f “hql.file”;
- hive jdbc code.
本节主要讲讲如何通过java来操作hive,首先启动HiveServer2,hiveserver2命令未来可用于替代hive命令1
2$ cd /home/hadoop/apache-hive-1.2.2-bin/bin
$ ./hiveserver2
启动后,你可能会发现,啥也没输出。这时我们在另一个SHELL窗口中启动beelie1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36$ cd /home/hadoop/apache-hive-1.2.2-bin/bin
$ ./beeline -u jdbc:hive2://hadoop-host-master:10000 -n hadoop -p hadoop
Connecting to jdbc:hive2://hadoop-host-master:10000
Connected to: Apache Hive (version 1.2.2)
Driver: Hive JDBC (version 1.2.2)
Transaction isolation: TRANSACTION_REPEATABLE_READ
Beeline version 1.2.2 by Apache Hive
0: jdbc:hive2://hadoop-host-master:10000>
0: jdbc:hive2://hadoop-host-master:10000> show databases;
+----------------+--+
| database_name |
+----------------+--+
| default |
| tim |
+----------------+--+
2 rows selected (0.217 seconds)
0: jdbc:hive2://hadoop-host-master:10000> use tim;
No rows affected (0.08 seconds)
0: jdbc:hive2://hadoop-host-master:10000> show tables;
+-----------+--+
| tab_name |
+-----------+--+
| t_user |
+-----------+--+
1 row selected (0.071 seconds)
0: jdbc:hive2://hadoop-host-master:10000> select * from t_user;
+------------+--------------+-------------+-------------+------------------+-------------------------------------------------------------------------+--+
| t_user.id | t_user.name | t_user.age | t_user.sex | t_user.birthday | t_user.address |
+------------+--------------+-------------+-------------+------------------+-------------------------------------------------------------------------+--+
| 1 | Tim Ho | 23 | M | 1989-05-01 | Higher Education Mega Center South, Guangzhou city, Guangdong Province |
| 2 | scott | 25 | M | 1977-10-21 | USA |
| 3 | tiger | 21 | F | 1977-08-12 | UK |
+------------+--------------+-------------+-------------+------------------+-------------------------------------------------------------------------+--+
3 rows selected (0.219 seconds)
0: jdbc:hive2://hadoop-host-master:10000>
由上面可知,和在hive命令下的操作是一样的。上面的命令也可以没有-p hadoop
这个参数,这个可以在hive-site.xml
中配置。
java代码操作hive的例子在这里:HiveUtil.java、HiveDemo.java
后台方式启动hive
For versions 1.2 and above, hive
is deprecated and the hiveserver2
command should be used directly.
So the correct way to start hiveserver2 in background is now:
cd /home/hadoop/apache-hive-1.2.2-bin/bin
nohup ./hiveserver2 &
Or with output to a log file:
nohup ./hiveserver2 > hive.log &
未完待续……