安装大数据套件(Hadoop, Hive...)

以下命令行脚本
宿主机 --- (dp) jansora@jansora-PC:~/docker/bigdata$ ls
容器 ---

准备工作

1. 安装Docker

参见 https://www.jansora.com/post/install-docker

2. 配置基础镜像 Ubuntu 20.04

拉取Ubuntu 镜像 [[docker pull ubuntu:20.04]]

3. 创建数据卷

执行 [[docker volume create bigdata]] 创建数据卷

(dp) jansora@jansora-PC:~/docker/bigdata$ docker volume create bigdata
bigdata
(dp) jansora@jansora-PC:~/docker/bigdata$ docker volume inspect bigdata
[
    {
        "CreatedAt": "2020-06-05T09:54:28+08:00",
        "Driver": "local",
        "Labels": {},
        "Mountpoint": "/var/lib/docker/volumes/bigdata/_data",
        "Name": "bigdata",
        "Options": {},
        "Scope": "local"
    }
]

4. 创建网络

[[docker network create -d bridge --subnet=192.168.0.0/24 --gateway=192.168.0.100 --ip-range=192.168.0.0/24 bigdata]]
查看

(dp) jansora@jansora-PC:~/docker/bigdata$ docker network ls
NETWORK ID          NAME                    DRIVER              SCOPE
6ba66f53ceba        bigdata                 bridge              local
45592e17a68e        bridge                  bridge              local

5. 启动容器

另开一个终端, 启动基础容器
[[docker run -it --network bigdata --mount source=bigdata,target=/data ubuntu:20.04]]

安装依赖组件

安装依赖前,我们先加速镜像源(可做可不做)

1. 配置 qinghua 镜像源加速(可选)

  1. 在宿主机 /var/lib/docker/volumes/bigdata/_data 目录下创建 sources.list,并写入以下内容
deb https://mirrors.tuna.tsinghua.edu.cn/ubuntu/ focal main restricted universe multiverse
# deb-src https://mirrors.tuna.tsinghua.edu.cn/ubuntu/ focal main restricted universe multiverse
deb https://mirrors.tuna.tsinghua.edu.cn/ubuntu/ focal-updates main restricted universe multiverse
# deb-src https://mirrors.tuna.tsinghua.edu.cn/ubuntu/ focal-updates main restricted universe multiverse
deb https://mirrors.tuna.tsinghua.edu.cn/ubuntu/ focal-backports main restricted universe multiverse
# deb-src https://mirrors.tuna.tsinghua.edu.cn/ubuntu/ focal-backports main restricted universe multiverse
deb https://mirrors.tuna.tsinghua.edu.cn/ubuntu/ focal-security main restricted universe multiverse
# deb-src https://mirrors.tuna.tsinghua.edu.cn/ubuntu/ focal-security main restricted universe multiverse

# 预发布软件源,不建议启用
# deb https://mirrors.tuna.tsinghua.edu.cn/ubuntu/ focal-proposed main restricted universe multiverse
# deb-src https://mirrors.tuna.tsinghua.edu.cn/ubuntu/ focal-proposed main restricted universe multiverse
  1. 安装依赖 apt update && apt install ca-certificates -y
  2. 在容器终端内执行 [[cp /data/sources.list /etc/apt/]]
  3. 更新软件包列表 [[apt update]]

安装依赖软件

容器内终端执行[[apt install vim net-tools lrzsz]]

配置 jdk8

1.容器内终端执行[[apt install openjdk-8-jdk -y]]

root@4b2705f587d8:/# java -version
openjdk version "1.8.0_252"
OpenJDK Runtime Environment (build 1.8.0_252-8u252-b09-1ubuntu1-b09)
OpenJDK 64-Bit Server VM (build 25.252-b09, mixed mode)
root@4b2705f587d8:/# apt install openjdk-8-jdk -y

2.配置JAVA_HOME . 容器内终端打开 vim $HOME/.bashrc , 文件底部写入export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/
3. 刷新 source $HOME/.bashrc && echo $JAVA_HOME

配置 ssh

输入 AsiaShanghai
1 安装 apt install ssh
2 启动 service ssh start
3 配置 ssh local 互信
3.1、ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
3.2、cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
3.3、chmod 0600 ~/.ssh/authorized_keys
3.4、[[ssh localhost]]

保存以上改动到 bigdata
镜像

4b2705f587d8 为 容器 id,不同容器可能会不同
宿主机内执行 docker commit 4b2705f587d8 jansora/bigdata:base

更换容器

  1. 保存好后退出当前容器(容器内执行 [[ctrl + d]])
  2. 启动新的容器 [[docker run -it --network bigdata --mount source=bigdata,target=/data jansora/bigdata
    ]]

配置 Hadoop

下载安装Hadoop

  1. 宿主机下载 hadoop-2.10.0 [[wget https://mirrors.bfsu.edu.cn/apache/hadoop/common/hadoop-2.10.0/hadoop-2.10.0.tar.gz]]
  2. 解压: tar xvf hadoop-2.10.0.tar.gz
  3. 移动到数据卷 [[sudo mv hadoop-2.10.0 /var/lib/docker/volumes/bigdata/_data]]
  4. 在容器内配置 HADOOP_HOME. 容器内终端打开 vim $HOME/.bashrc , 文件底部写入export HADOOP_HOME=/data/hadoop-2.10.0
  5. 配置 sh 启动脚本确保可以root启动
    start-dfs.sh中追加:
HDFS_DATANODE_USER=root
HADOOP_SECURE_DN_USER=hdfs
HDFS_NAMENODE_USER=root
HDFS_SECONDARYNAMENODE_USER=root 

start-yarn.sh

YARN_RESOURCEMANAGER_USER=root
HADOOP_SECURE_DN_USER=yarn
YARN_NODEMANAGER_USER=root

配置 hadoop etc/hadoop/hadoop-env.sh:

追加

export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/

配置 etc/hadoop/core-site.xml:

创建文件夹 [[mkdir -p ${HADOOP_HOME}/data/temp]]

<configuration>
        <property>
                <name>fs.defaultFS</name>
                <value>hdfs://master:19230</value>
        </property>
        <property>
                <name>hadoop.tmp.dir</name>
                <value>${HADOOP_HOME}/data/temp</value>
        </property>
        <property>
                <name>io.file.buffer.size</name>
                <value>4096</value>
        </property>
        <property>
                <name>fs.trash.interval</name>
                <value>10080</value>
        </property>
</configuration>

配置 etc/hadoop/mapred-site.xml:

<configuration>
        <property>
          <name>mapreduce.framework.name</name>
          <value>yarn</value>
        </property>

        <property>
          <name>mapreduce.job.ubertask.enable</name>
          <value>true</value>
        </property>

        <property>
          <name>mapreduce.jobhistory.address</name>
          <value>master:19238</value>
        </property>
        <property>
          <name>mapreduce.jobhistory.webapp.address</name>
          <value>master:19239</value>
       </property>

       <property>
         <name>yarn.app.mapreduce.am.env</name>
         <value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>
       </property>

       <property>
         <name>mapreduce.map.env</name>
         <value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>
       </property>

       <property>
         <name>mapreduce.reduce.env</name>
         <value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>
       </property>
       <property>
         <name>mapreduce.application.classpath</name>
         <value>$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/*,$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/lib/*</value>
       </property>

</configuration>

配置 etc/hadoop/yarn-site.xml:

<configuration>
        <!-- Site specific YARN configuration properties -->
        <property>
                <name>yarn.resourcemanager.hostname</name>
                <value>master</value>
        </property>
        <property>
                <name>yarn.nodemanager.aux-services</name>
                <value>mapreduce_shuffle</value>
        </property>
        <property>
                <name>yarn.resourcemanager.address</name>
                <value>master:18040</value>
        </property>
        <property>
                <name>yarn.resourcemanager.scheduler.address</name>
                <value>master:18030</value>
        </property>
        <property>
                <name>yarn.resourcemanager.resource-tracker.address</name>
                <value>master:18025</value>
        </property>
        <property>
                <name>yarn.resourcemanager.admin.address</name>
                <value>master:18141</value>
        </property>
        <property>
                <name>yarn.resourcemanager.webapp.address</name>
                <value>master:18088</value>
        </property>
</configuration>

配置 etc/hadoop/hdfs-site.xml

<configuration>
        <property>
                <name>dfs.namenode.secondary.http-address</name>
                <value>master:19231</value>
        </property>
        <property>
                <name>dfs.namenode.http-address</name>
                <value>master:19232</value>
        </property>
        <property>
                <name>dfs.namenode.name.dir</name>
                <value>file:///hdfs/data/namenodeDatas</value>
        </property>
        <property>
                <name>dfs.datanode.data.dir</name>
                <value>file:///hdfs/data/datanodeDatas</value>
        </property>
        <property>
                <name>dfs.namenode.edits.dir</name>
                <value>file:///hdfs/data/dfs/nn/edits</value>
        </property>
        <property>
                <name>dfs.namenode.checkpoint.dir</name>
                <value>file:///hdfs/data/dfs/snn/name</value>
        </property>
        <property>
                <name>dfs.namenode.checkpoint.edits.dir</name>
                <value>file:///hdfs/data/dfs/nn/snn/edits</value>
        </property>
        <property>
                <name>dfs.permissions</name>
                <value>false</value>
        </property>

        <!--指定block块的的大小-->
        <property>
                <name>dfs.blocksize</name>
                <value>134217728</value>
        </property>
        <!-- -->
        <property>
                <name>dfs.namenode.handler.count</name>
                <value>100</value>
        </property>
        <!--block的副本数-->
        <property>
                <name>dfs.replication</name>
                <value>3</value>
        </property>
</configuration>


配置 etc/hadoop/slaves

配置机器集群, 需要注意的是,这里的每行都是一个容器,docker启动的时候需要指定 --name
集群主机启动的时候需要指定 docker run -it --name master --......
集群从机1启动的时候需要指定 docker run -it --name slave1 --......
集群从机2启动的时候需要指定 docker run -it --name slave2 --......
... 需要启动多少从机就指定多少, 与 slaves 对应

master
slave1
slave2
slave3
......

启动

./start-all.sh

验证安装状态

root@master:/data/hadoop-2.10.0/etc/hadoop# hadoop fs -mkdir /input
root@master:/data/hadoop-2.10.0/etc/hadoop# echo 'test' > test.txt
root@master:/data/hadoop-2.10.0/etc/hadoop# hadoop fs -put test.txt /input/
root@master:/data/hadoop-2.10.0/etc/hadoop# hadoop fs -ls /input
Found 1 items
-rw-r--r--   3 root supergroup          5 2020-06-10 15:20 /input/test.txt
root@master:/data/hadoop-2.10.0/etc/hadoop# hadoop fs -cat /input/test.txt
test

保存以上改动到 bigdata
镜像

4b2705f587d8 为 容器 id,不同容器可能会不同
宿主机内执行 docker commit 4b2705f587d8 jansora/bigdata:base-hadoop

更换容器

  1. 保存好后退出当前容器(容器内执行 [[ctrl + d]])
  2. 启动新的容器 [[docker run -it --network bigdata --mount source=bigdata,target=/data --publish 19231:19231 --publish 19232:19232 --publish 19233:19233 --publish 19234:19234 --publish 19235:19235 --publish 19236:19236 --publish 19237:19237 --publish 19238:19238 --publish 19239:19239 --publish 19240:19240 --publish 19241:19241 jansora/bigdata
    ]]

问题追踪

Hadoop查看Secondary Namenode管理页面无信息的解决办法
https://blog.csdn.net/u012834750/article/details/80508464

配置Hive

有参考 https://www.jianshu.com/p/40fc2414bc7f

下载hive

[[wget https://mirrors.tuna.tsinghua.edu.cn/apache/hive/hive-2.3.7/apache-hive-2.3.7-bin.tar.gz]]

配置环境变量

hadoop 与以上保持一致
配置环境变量 [[vim ~/.bashrc]]

export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/
export HADOOP_HOME=/data/hadoop-2.10.0
export HIVE_HOME=/data/apache-hive-2.3.7-bin

export PATH=$HADOOP_HOME/bin:$PATH
export PATH=$HADOOP_HOME/sbin:$PATH
export PATH=$HIVE_HOME/bin:$PATH

service ssh start

配置 metascore

  1. 使用默认的 derby 来初始化
    [[schematool -dbType derby -initSchema]]

derby 会把原信息保存到执行命令的当前目录下的 metastore_db 目录, 因此后续启动hive 需要从当前目录执行.

  1. 使用默认的 mysql 来初始化

配置 MySQL驱动

这里需要配置一个外部的MySQL, 如何安装MySQL此处不解释了
假定 mysql 端口号=1688 用户名=root 密码=123456 且已开启远程访问
2.1 下载mysql驱动
选择mysql 驱动 https://downloads.mysql.com/archives/c-j/ , 最新版本即可

  1. [[wget https://downloads.mysql.com/archives/get/p/3/file/mysql-connector-java-8.0.19.tar.gz]]
  2. 解压后放到 $HIVE_HOME/lib
root@master:/data/apache-hive-2.3.7-bin/lib# ls -l
total 2304
-rw-r--r-- 1 root root 2356711 Dec  4  2019 mysql-connector-java-8.0.19.jar

2.2 配置 hive-site.xml
$HIVE_HOME/conf 新建 hive-site.xml , 更多配置项可参考 $HIVE_HOME/conf/hive-default.xml.template

<configuration>
        <property>
                <name>javax.jdo.option.ConnectionURL</name>
                <value>jdbc:mysql://192.168.31.231:1688/hive?createDatabaseIfNotExist=true</value>
                <description>JDBC connect string for a JDBC metastore</description>
        </property>
        <property>
                <name>javax.jdo.option.ConnectionDriverName</name>
                <value>com.mysql.jdbc.Driver</value>
                <description>Driver class name for a JDBC metastore</description>
        </property>
        <property>
                <name>javax.jdo.option.ConnectionUserName</name>
                <value>root</value>
                <description>username to use against metastore database</description>
        </property>
        <property>
                <name>javax.jdo.option.ConnectionPassword</name>
                <value>123456</value>
                <description>password to use against metastore database</description>
        </property>
        <property>
                <name>datanucleus.autoCreateSchema</name>
                <value>true</value>
        </property>
        <property>
                <name>datanucleus.autoCreateTables</name>
                <value>true</value>
        </property>
        <property>
                <name>datanucleus.autoCreateColumns</name>
                <value>true</value>
        </property>
        <!-- 设置 hive仓库的HDFS上的位置 -->
        <property>
                <name>hive.metastore.warehouse.dir</name>
                <value>/hive</value>
                <description>location of default database for the warehouse</description>
        </property>
        <!--资源临时文件存放位置 -->
        <property>
                <name>hive.downloaded.resources.dir</name>
                <value>/tmp/hive/resources</value>
                <description>Temporary local directory for added resources in the remote file system.</description>
        </property>
        <!-- Hive在0.9版本之前需要设置hive.exec.dynamic.partition为true, Hive在0.9版本之后默认为true -->
        <property>
                <name>hive.exec.dynamic.partition</name>
                <value>true</value>
        </property>
        <property>
                <name>hive.exec.dynamic.partition.mode</name>
                <value>nonstrict</value>
        </property>


        <property>
                <name>hive.server2.thrift.bind.host</name>
                <value>127.0.0.1</value>
        </property>
        <property>
                <name>hive.server2.thrift.port</name>
                <value>10000</value>
        </property>
        <property>
                <name>hive.server2.thrift.http.port</name>
                <value>10001</value>
        </property>
        <property>
                <name>hive.server2.thrift.http.path</name>
                <value>cliservice</value>
        </property>
        <!-- HiveServer2的WEB UI -->
        <property>
                <name>hive.server2.webui.host</name>
                <value>127.0.0.1</value>
        </property>
        <property>
                <name>hive.server2.webui.port</name>
                <value>19238</value>
        </property>
        <property>
                <name>hive.scratch.dir.permission</name>
                <value>755</value>
        </property>

        <property>
                <name>hive.server2.enable.doAs</name>
                <value>false</value>
        </property>
        <!-- property>
                <name>hive.server2.authentication</name>
                <value>NOSASL</value>
        </property -->
        <property>
                <name>hive.auto.convert.join</name>
                <value>false</value>
        </property>
</configuration>

启动Hive

以命令行启动 [[hive]]
以后台服务启动 [[nohup hive --service hiveserver2 liap metascore &]]

root@master:/# hive --service --help
Available Services: 
beeline cleardanglingscratchdir cli hbaseimport hbaseschematool help 
hiveburninclient hiveserver2 hplsql jar lineage llap llapdump llapstatus 
metastore metatool orcfiledump rcfilecat schemaTool version 
root@master:/data/apache-hive-2.3.7-bin/bin# hive
hive> show databases;
OK
default
Time taken: 3.171 seconds, Fetched: 1 row(s)
hive> create database abc;
OK
Time taken: 2.634 seconds
hive> show databases;
OK
abc
default

保存以上改动到 bigdata
镜像

4b2705f587d8 为 容器 id,不同容器可能会不同
宿主机内执行 docker commit 4b2705f587d8 jansora/bigdata:hadoop-hive

更换容器

  1. 保存好后退出当前容器(容器内执行 [[ctrl + d]])
  2. 启动新的容器 [[docker run -it --network bigdata --mount source=bigdata,target=/data --publish 19231:19231 --publish 19232:19232 --publish 19233:19233 --publish 19234:19234 --publish 19235:19235 --publish 19236:19236 --publish 19237:19237 --publish 19238:19238 --publish 19239:19239 --publish 19240:19240 --publish 19241:19241 jansora/bigdata
    ]]

安装 spark

下载 spark

[[wget https://mirrors.tuna.tsinghua.edu.cn/apache/spark/spark-2.4.5/spark-2.4.5-bin-hadoop2.7.tgz]]

配置环境变量

在.bashrc添加如下配置:

export SPARK_HOME=/data/spark-2.4.5-bin-without-hadoop

配置生效 source ~/.bashrc

配置 slave 主从节点

参考 $HADOOP_HOME/conf/slaves. 配一样即可

配置 $SPARK_HOME/conf/spark-env.sh

export SPARK_DIST_CLASSPATH=$(/data/hadoop-2.10.0/bin/hadoop classpath)
export HADOOP_CONF_DIR=/data/hadoop-2.10.0
export SPARK_MASTER_IP=master
export SPARK_MASTER_WEBUI_PORT=19241

启动 spark

[[cd $SPARK_HOME/sbin && ./start-all.sh]]

保存以上改动到 bigdata
镜像

4b2705f587d8 为 容器 id,不同容器可能会不同
宿主机内执行 docker commit 4b2705f587d8 jansora/bigdata:hadoop-hive-spark

更换容器

  1. 保存好后退出当前容器(容器内执行 [[ctrl + d]])
  2. 启动新的容器 [[docker run -it --network bigdata --mount source=bigdata,target=/data --publish 19231:19231 --publish 19232:19232 --publish 19233:19233 --publish 19234:19234 --publish 19235:19235 --publish 19236:19236 --publish 19237:19237 --publish 19238:19238 --publish 19239:19239 --publish 19240:19240 --publish 19241:19241 jansora/bigdata
    ]]

安装 hbase

下载 hbase

[[wget https://mirrors.tuna.tsinghua.edu.cn/apache/hbase/2.2.5/hbase-2.2.5-bin.tar.gz]]

配置环境变量

在.bashrc添加如下配置:

export HBASE_HOME=/data/hbase-2.2.5
export PATH=$HBASE_HOME/bin:$PATH

配置 hbase-site.xml

在原有内容下加入以下内容, 启用web端口

  <property>
          <name>hbase.master.info.port</name>
          <value>19241</value>
  </property>

启动hbase

[[start-hbase.sh]]

取消文件挂载

  1. 拷贝 data: [[cp -r /data /data.bak]]
  2. 宿主机保存镜像: 宿主机内执行 docker commit 4b2705f587d8 jansora/bigdata:hadoop-hive-spark-hbase-with-dependencies
  3. 退出容器, 进入容器 jansora/bigdata:hadoop-hive-spark-hbase-with-dependencies
    4.拷贝 data: [[mv /data.bak /data]]
  4. 宿主机保存镜像: 宿主机内执行 docker commit 4b2705f587d8 jansora/bigdata:hadoop-hive-spark-hbase-with-dependencies
  5. push 到 docker 容器 docker push jansora/bigdata:hadoop-hive-spark-hbase-with-dependencies

评论栏