1. DataX 简介:
DataX 是阿里巴巴集团内被广泛使用的离线数据同步工具/平台,实现包括 MySQL、Oracle、SQLServer、Postgre、HDFS、Hive、ADS、HBase、TableStore(OTS)、MaxCompute(ODPS)、DRDS 等各种异构数据源之间高效的数据同步功能。本例中将使用 DataX 实现 Hive 与MySQL之间的数据传输。
本例中所使用的资源为三台 2 核 4G 内存,40G SSD 硬盘的 DC2,EIP 带宽为 1M。
DataC:10.254.125.48
Hive:10.254.237.61
MySQL:10.254.116.249
此处 IP 为云主机内网 IP,方便三台主机之间内网通信,而下载文件或外部访问则需 EIP,即外网 IP 或弹性 IP。有关滴滴云 EIP 的使用请参考以下链接:https://help.didiyun.com/hc/kb/section/1035272/
2. 在 Hive 节点安装 Hadoop 2.7.7+Hive 2.3.4
- 滴滴云主机出于安全考虑,默认不能通过 root 用户直接登录,需要先用 dc2-user 登录,让后用 sudo su 切换至 root。本例中默认全部以 dc2-user 用户运行命令,Hadoop 默认用户同样为 dc2-user。
-
设置免密登录,为 dc2-user 生成公钥。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
ssh-keygen -b 4096 Generating public/private rsa key pair. Enter file in which to save the key (/home/hadoop/.ssh/id_rsa): Created directory '/home/hadoop/.ssh'. Enter passphrase (empty for no passphrase): Enter same passphrase again: Your identification has been saved in /home/hadoop/.ssh/id_rsa. Your public key has been saved in /home/hadoop/.ssh/id_rsa.pub. The key fingerprint is: SHA256:zRhhVpEfSIZydqV75775sZB0GBjZ/f7nnZ4mgfYrWa8 hadoop@10-254-149-24 The key's randomart image is: +---[RSA 4096]----+ | ++=*+ . | | .o+o+o+. . | | +...o o .| | = .. o .| | S + oo.o | | +.=o .| | . +o+..| | o +.+O| | .EXO=| +----[SHA256]-----+ |
输入以下命令将生成的公钥复制到本机:
1 2 |
ssh-copy-id -i $HOME/.ssh/id_rsa.pub dc2-user@localhost |
- 配置 Java 环境
- 下载 JDK
1 2 3 4 |
cd /home/dc2-user wget --no-check-certificate --no-cookies --header "Cookie: oraclelicense=accept-securebackup-cookie" http://download.oracle.com/otn-pub/java/jdk/8u191-b12/2787e4a523244c269598db4e85c51e0c/jdk-8u191-linux-x64.tar.gz tar -zxf jdk-8u191-linux-x64.tar.gz |
- 配置 Java 变量
1 2 3 4 5 6 |
sudo vi /etc/profile.d/jdk-1.8.sh export JAVA_HOME=/home/dc2-user/jdk1.8.0_191 export JRE_HOME=${JAVA_HOME}/jre export CLASSPATH=.:${JAVA_HOME}/lib:${JRE_HOME}/lib export PATH=${JAVA_HOME}/bin:$PATH |
- 使环境变量生效
1 2 |
source /etc/profile |
- 查看 Java 版本
1 2 3 4 5 |
java -version java version "1.8.0_191" Java(TM) SE Runtime Environment (build 1.8.0_191-b12) Java HotSpot(TM) 64-Bit Server VM (build 25.191-b12, mixed mode) |
出现以上输出,说明 Java 环境已经配置成功。
- 安装 Hadoop
节点下载 Hadoop 2.7.7 并解压
1 2 3 4 |
cd /home/dc2-user wget http://mirrors.shu.edu.cn/apache/hadoop/common/hadoop-2.7.7/hadoop-2.7.7.tar.gz<br/> tar zxf hadoop-2.7.7.tar.gz<br/> |
在 /home/dc2-user/hadoop-2.7.7/etc/hadoop 下需要配置的5个文件分别是 hadoop-env.sh、core-site.xml、hdfs-site.xml、yarn-site.xml、mapred-site.xml
- hadoop-env.sh添加如下内容
1 2 3 4 5 6 7 |
export JAVA_HOME=/home/dc2-user/jdk1.8.0_191 export HDFS_NAMENODE_USER="dc2-user" export HDFS_DATANODE_USER="dc2-user" export HDFS_SECONDARYNAMENODE_USER="dc2-user" export YARN_RESOURCEMANAGER_USER="dc2-user" export YARN_NODEMANAGER_USER="dc2-user" |
- core-site.xml
1 2 3 4 5 6 7 |
<configuration> <property> <name>fs.default.name</name> <value>hdfs://10.254.237.61:9000</value> </property> </configuration> |
- hdfs-site.xml
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
<configuration> <property> <name>dfs.namenode.name.dir</name> <value>/home/dc2-user/data/nameNode</value> </property> <property> <name>dfs.datanode.data.dir</name> <value>/home/dc2-user/data/dataNode</value> </property> <property> <name>dfs.replication</name> <value>1</value> </property> <property> <name>dfs.http.address</name> <value>10.254.237.61:50070</value> </property> </configuration> |
- yarn-site.xml
1 2 3 4 5 6 7 8 9 10 11 |
<configuration> <property> <name>yarn.nodemanager.auxservices.mapreduce_shuffle.class</name> <value>org.apache.hadoop.mapred.ShuffleHandler</value> </property> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> </configuration> |
- mapred-site.xml
1 2 3 4 5 6 7 |
<configuration> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> </configuration> |
- 配置 Hadoop 环境变量
1 2 3 4 5 6 |
sudo vi /etc/profile.d/hadoop-2.7.7.sh export HADOOP_HOME="/home/dc2-user/hadoop-2.7.7" export PATH="$HADOOP_HOME/bin:$PATH" export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop export YARN_CONF_DIR=$HADOOP_HOME/etc/hadoop |
使环境变量生效
1 2 |
source /etc/profile |
输入 haoop version 看是否有输出,验证环境变量是否生效
1 2 3 4 5 6 7 8 |
hadoop version Hadoop 2.7.7 Subversion Unknown -r c1aad84bd27cd79c3d1a7dd58202a8c3ee1ed3ac Compiled by stevel on 2018-07-18T22:47Z Compiled with protoc 2.5.0 From source with checksum 792e15d20b12c74bd6f19a1fb886490 This command was run using /home/dc2-user/hadoop-2.7.7/share/hadoop/common/hadoop-common-2.7.7.jar |
- 格式化 HDFS
1 2 |
/home/dc2-user/hadoop-2.7.7/bin/hdfs namenode -format testCluster |
- 开启服务
1 2 3 |
/home/dc2-user/hadoop-2.7.7/sbin/start-dfs.sh /home/dc2-user/hadoop-2.7.7/sbin/start-yarn.sh |
- 查看服务是否已启动
1 2 3 4 5 6 7 8 |
jps 13680 Jps 13268 ResourceManager 13367 NodeManager 12956 DataNode 13117 SecondaryNameNode 12830 NameNode |
出现以上输出,说明服务已经正常启动,可以通过 Hive 的公网 IP 访问 DFS 的 web 页面,注意要打开安全组的 50070 端口,关于滴滴云安全组的使用请参考以下链接:https://help.didiyun.com/hc/kb/article/1091031/
注:公网开放 50070 端口可能会被黑客利用植入木马,因此建议在安全组中限制可访问的来源 IP,或者不在安全组中开放此端口。
3. Hive 2.3.4 安装和配置
Hive 是基于 Hadoop 的一个数据仓库,可以将结构化的数据文件映射为一张表,并提供类 SQL 查询功能,Hive 底层将SQL 语句转化为 MapReduce 任务运行。
– 下载 Hive 2.3.4 到 Master 的 /home/dc2-user 并解压
1 2 3 |
wget http://mirror.bit.edu.cn/apache/hive/hive-2.3.4/apache-hive-2.3.4-bin.tar.gz tar zxvf apache-hive-2.3.4-bin.tar.gz |
- 设置 Hive 环境变量
编辑 /etc/profile.d/hive.sh 文件, 在其中添加以下内容:
1 2 3 4 |
sudo vi /etc/profile.d/hive.sh export HIVE_HOME=/home/dc2-user/apache-hive-2.3.4-bin export PATH=$PATH:$HIVE_HOME/bin |
使环境变量生效:
1 2 |
source /etc/profile |
- 配置 Hive
重命名以下配置文件:
1 2 3 4 5 6 |
<br />cd apache-hive-2.3.4-bin/conf/ cp hive-env.sh.template hive-env.sh cp hive-default.xml.template hive-site.xml cp hive-log4j2.properties.template hive-log4j2.properties cp hive-exec-log4j2.properties.template hive-exec-log4j2.properties |
修改 hive-env.sh:
1 2 3 4 5 |
<br />export JAVA_HOME=/home/dc2-user/jdk1.8.0_191 export HADOOP_HOME=/home/dc2-user/hadoop-2.7.7 export HIVE_HOME=/home/dc2-user/apache-hive-2.3.4-bin export HIVE_CONF_DIR=$HIVE_HOME/conf |
修改 hive-site.xml
修改对应属性的 value 值
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
vi hive-site.xml <configuration> <property> <name>hive.exec.scratchdir</name> <value>/tmp/hive-${user.name}</value> </property> <property> <name>hive.exec.local.scratchdir</name> <value>/tmp/${user.name}</value> </property> <property> <name>hive.downloaded.resources.dir</name> <value>/tmp/hive/resources</value> </property> <property> <name> hive.querylog.location</name> <value>/tmp/${user.name}</value> </property> <property> <name>hive.server2.logging.operation.log.location</name> <value>/tmp/${user.name}/operation_logs</value> </property> </configuration> |
- 配置 Hive Metastore
Hive Metastore 是用来获取 Hive 表和分区的元数据,本例中使用 MariaDB 来存储此类元数据。
下载 mysql-connector-java-5.1.40-bin.jar 放入 $HIVE_HOME/lib 下并在 hive-site.xml 中添加 MySQL 数据库连接信息。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
<configuration> <property> <name>javax.jdo.option.ConnectionURL</name> <value>jdbc:mysql://localhost:3306/hive?createDatabaseIfNotExist=true</value> </property> <property> <name>javax.jdo.option.ConnectionDriverName</name> <value>com.mysql.jdbc.Driver</value> </property> <property> <name>javax.jdo.option.ConnectionUserName</name> <value>hive</value> </property> <property> <name>javax.jdo.option.ConnectionPassword</name> <value>hive</value> </property> </configuration> |
安装 MySQL,本例中使用的是 MariaDB。
1 2 3 |
sudo yum install -y mariadb-server sudo systemctl start mariadb |
登录 MySQL,初始无密码,创建 Hive 用户并设置密码。
1 2 3 4 5 6 7 |
mysql -uroot MariaDB [(none)]> create user'hive'@'localhost' identified by 'hive'; Query OK, 0 rows affected (0.00 sec) MariaDB [(none)]> grant all privileges on *.* to hive@localhost identified by 'hive'; Query OK, 0 rows affected (0.00 sec) |
- 运行 Hive
运行 Hive 之前必须保证 HDFS 已经启动,可以使用 start-dfs.sh 来启动,如果之前安装 Hadoop 已启动,此步骤可略过。
从 Hive 2.1 版本开始, 在启动 Hive 之前需运行 SchemaTool 命令来执行初始化操作。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
schematool -dbType mysql -initSchema SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/home/dc2-user/apache-hive-2.3.4-bin/lib/log4j-slf4j-impl-2.6.2.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/home/dc2-user/hadoop-2.7.7/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory] Metastore connection URL: jdbc:mysql://localhost:3306/hive?createDatabaseIfNotExist=true Metastore Connection Driver : com.mysql.jdbc.Driver Metastore connection User: hive Starting metastore schema initialization to 2.3.0 Initialization script hive-schema-2.3.0.mysql.sql Initialization script completed schemaTool completed |
启动 Hive,输入命令 Hive
1 2 3 4 5 6 7 8 9 10 11 12 13 |
hive which: no hbase in (/home/dc2-user/java/jdk1.8.0_191/bin:/home/dc2-user/hadoop-2.7.7/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/usr/local/bin:/home/dc2-user/apache-hive-2.3.4-bin/bin:/home/dc2-user/.local/bin:/home/dc2-user/bin) SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/home/dc2-user/apache-hive-2.3.4-bin/lib/log4j-slf4j-impl-2.6.2.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/home/dc2-user/hadoop-2.7.7/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory] Logging initialized using configuration in file:/home/dc2-user/apache-hive-2.3.4-bin/conf/hive-log4j2.properties Async: true Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases. hive> |
- 测试 Hive
在 Hive中创建一个数据库并在数据库中创建一个表:
1 2 3 4 5 6 7 8 9 10 |
hive> create database hive_datax; OK Time taken: 4.137 seconds hive> use hive_datax; OK Time taken: 0.017 seconds hive> create table hive_datax(id int, name string); OK Time taken: 0.5 seconds |
可以看到表已经创建成功,向表中输入数据。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
hive> insert into hive_datax values(1,'tom'); WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases. Query ID = dc2-user_20181204110438_eb16f016-a5e2-46bf-ad45-123d7f87e727 Total jobs = 3 Launching Job 1 out of 3 Number of reduce tasks is set to 0 since there's no reduce operator Job running in-process (local Hadoop) 2018-12-04 11:04:43,296 Stage-1 map = 100%, reduce = 0% Ended Job = job_local1630301164_0001 Stage-4 is selected by condition resolver. Stage-3 is filtered out by condition resolver. Stage-5 is filtered out by condition resolver. Moving data to directory hdfs://10.254.237.61:9000/user/hive/warehouse/hive_datax.db/hive_datax/.hive-staging_hive_2018-12-04_11-04-38_784_3010010078484421565-1/-ext-10000 Loading data to table hive_datax.hive_datax MapReduce Jobs Launched: Stage-Stage-1: HDFS Read: 6 HDFS Write: 89 SUCCESS Total MapReduce CPU Time Spent: 0 msec OK Time taken: 5.107 seconds |
查看表中数据:
1 2 3 4 5 |
hive> select * from hive_datax; OK 1 tom Time taken: 0.191 seconds, Fetched: 1 row(s) |
可以看到数据插入成功并且可以正常查询。
4. 在 MySQL 节点安装 MariaDB
MySQL 的安装可以参照 Hive 节点中安装 MariaDB 的步骤,启动完成后,设置 root 密码并开启远程登录:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
mysql -uroot Welcome to the MariaDB monitor. Commands end with ; or \g. Your MariaDB connection id is 2 Server version: 5.5.56-MariaDB MariaDB Server Copyright (c) 2000, 2017, Oracle, MariaDB Corporation Ab and others. Type 'help;' or '\h' for help. Type '\c' to clear the current input statement. MariaDB [(none)]> set password for 'root'@'localhost'=password('******'); MariaDB [(none)]> GRANT ALL PRIVILEGES ON *.* TO 'root'@'%'IDENTIFIED BY '********' WITH GRANT OPTION; Query OK, 0 rows affected (0.00 sec) MariaDB [(none)]> flush privileges; Query OK, 0 rows affected (0.00 sec) |
接下来在创建数据库和将要导入数据的表:
1 2 3 4 5 6 7 |
MariaDB [(none)]> create database mysql_datax; Query OK, 1 row affected (0.00 sec) MariaDB [(none)]> use mysql_datax; Database changed MariaDB [mysql_datax]> create table mysql_datax(id int(10), name varchar(20),primary key(id)); Query OK, 0 rows affected (0.00 sec) |
5. DataX 数据同步
在 DataX 节点的操作比较简单,首先要配置 Java 环境变量,可参照前文中的介绍,然后下载 DataX 并解压。
1 2 3 4 5 |
cd /home/dc2-user wget http://datax-opensource.oss-cn-hangzhou.aliyuncs.com/datax.tar.gz tar zxvf datax.tar.gz cd datax |
- Hive 到 MySQL 的数据同步,编辑 job 文件:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 |
cd job vi hive2mysql { "job": { "content": [ { "reader": { "parameter": { "path": "/user/hive/warehouse/hive_datax.db/hive_datax/*", "defaultFS": "hdfs://10.254.237.61:9000", "column": [ { "index": 0, "type": "long" }, { "index": 1, "type": "string" } ], "fileType": "text", "encoding": "UTF-8", "fieldDelimiter": "\u0001" }, "name": "hdfsreader" }, "writer": { "parameter": { "password": "********", "column": [ "id", "name" ], "connection": [ { "jdbcUrl": "jdbc:mysql://10.254.116.249:3306/mysql_datax", "table": [ "mysql_datax" ] } ], "writeMode": "insert", "batchSize": 1024, "username": "root" }, "name": "mysqlwriter" } } ], "setting": { "speed": { "channel": "1" } } } } |
运行 DataX 开始同步数据
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
python /home/dc2-user/datax/bin/datax.py /home/dc2-user/datax/job/hive2mysql 2018-12-04 16:37:28.809 [job-0] INFO JobContainer - PerfTrace not enable! 2018-12-04 16:37:28.810 [job-0] INFO StandAloneJobContainerCommunicator - Total 1 records, 4 bytes | Speed 0B/s, 0 records/s | Error 0 records, 0 bytes | All Task WaitWriterTime 0.000s | All Task WaitReaderTime 0.000s | Percentage 100.00% 2018-12-04 16:37:28.812 [job-0] INFO JobContainer - 任务启动时刻 : 2018-12-04 16:37:17 任务结束时刻 : 2018-12-04 16:37:28 任务总计耗时 : 11s 任务平均流量 : 0B/s 记录写入速度 : 0rec/s 读出记录总数 : 1 读写失败总数 : 0 |
如果输出结尾为以上内容,说明数据同步完成,查看 MySQL 中是否有数据。
1 2 3 4 5 6 7 8 9 10 11 12 13 |
MariaDB [(none)]> use mysql_datax; Reading table information for completion of table and column names You can turn off this feature to get a quicker startup with -A Database changed MariaDB [mysql_datax]> select * from mysql_datax; +----+------+ | id | name | +----+------+ | 1 | tom | +----+------+ 1 row in set (0.00 sec) |
说明数据同步成功。
- MySQL 到 Hive 的数据同步
首先在 MySQL 节点的 mysql_datax 表插入一条新记录:
1 2 3 4 5 6 7 8 9 10 11 12 |
MariaDB [mysql_datax]> insert into mysql_datax values(2,'jerry'); Query OK, 1 row affected (0.00 sec) MariaDB [mysql_datax]> select * from mysql_datax; +----+-------+ | id | name | +----+-------+ | 1 | tom | | 2 | jerry | +----+-------+ 2 rows in set (0.00 sec) |
在 DataX 节点编辑 mysql2hive
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 |
{ "job": { "setting": { "speed": { "channel": 3 }, "errorLimit": { "record": 0, "percentage": 0.02 } }, "content": [ { "reader": { "name": "mysqlreader", "parameter": { "writeMode": "insert", "username": "root", "password": "********", "column": [ "id", "name" ], "splitPk": "id", "connection": [ { "table": [ "mysql_datax" ], "jdbcUrl": [ "jdbc:mysql://10.254.116.249:3306/mysql_datax" ] } ] } }, "writer": { "name": "hdfswriter", "parameter": { "defaultFS": "hdfs://10.254.237.61:9000", "fileType": "text", "path": "/user/hive/warehouse/hive_datax.db/hive_datax/", "fileName": "test", "column": [ { "name": "id", "type": "int" }, { "name": "name", "type": "string" } ], "writeMode": "append", "fieldDelimiter": "\u0001", "compress":"gzip" } } } ] } } |
运行 DataX 同步数据:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
python /home/dc2-user/datax/bin/datax.py /home/dc2-user/datax/job/mysql2hive 2018-12-04 20:50:16.375 [job-0] INFO JobContainer - PerfTrace not enable! 2018-12-04 20:50:16.376 [job-0] INFO StandAloneJobContainerCommunicator - Total 2 records, 10 bytes | Speed 1B/s, 0 records/s | Error 0 records, 0 bytes | All Task WaitWriterTime 0.000s | All Task WaitReaderTime 0.000s | Percentage 100.00% 2018-12-04 20:50:16.378 [job-0] INFO JobContainer - 任务启动时刻 : 2018-12-04 20:50:04 任务结束时刻 : 2018-12-04 20:50:16 任务总计耗时 : 11s 任务平均流量 : 1B/s 记录写入速度 : 0rec/s 读出记录总数 : 2 读写失败总数 : 0 |
如果输出结尾与以上内容一样,说明同步成功,注意这里读出 2 条记录,会把第一条记录再同步一遍。
查看 Hive 表中的内容:
1 2 3 4 5 6 |
hive> select * from hive_datax; OK 1 tom 1 tom 2 jerry |
可以看到新记录与旧记录都被写入。
参考链接:https://github.com/alibaba/DataX
本文作者:贺子一