以下命令行脚本
宿主机 --- (dp) jansora@jansora-PC:~/docker/bigdata$ ls
容器 ---
准备工作
1. 安装Docker
参见 https://www.jansora.com/post/install-docker
2. 配置基础镜像 Ubuntu 20.04
拉取Ubuntu 镜像 [[docker pull ubuntu:20.04]]
3. 创建数据卷
执行 [[docker volume create bigdata]] 创建数据卷
(dp) jansora@jansora-PC:~/docker/bigdata$ docker volume create bigdata
bigdata
(dp) jansora@jansora-PC:~/docker/bigdata$ docker volume inspect bigdata
[
{
"CreatedAt": "2020-06-05T09:54:28+08:00",
"Driver": "local",
"Labels": {},
"Mountpoint": "/var/lib/docker/volumes/bigdata/_data",
"Name": "bigdata",
"Options": {},
"Scope": "local"
}
]
4. 创建网络
[[docker network create -d bridge --subnet=192.168.0.0/24 --gateway=192.168.0.100 --ip-range=192.168.0.0/24 bigdata]]
查看
(dp) jansora@jansora-PC:~/docker/bigdata$ docker network ls
NETWORK ID NAME DRIVER SCOPE
6ba66f53ceba bigdata bridge local
45592e17a68e bridge bridge local
5. 启动容器
另开一个终端, 启动基础容器
[[docker run -it --network bigdata --mount source=bigdata,target=/data ubuntu:20.04]]
安装依赖组件
安装依赖前,我们先加速镜像源(可做可不做)
1. 配置 qinghua 镜像源加速(可选)
- 在宿主机
/var/lib/docker/volumes/bigdata/_data
目录下创建sources.list
,并写入以下内容
deb https://mirrors.tuna.tsinghua.edu.cn/ubuntu/ focal main restricted universe multiverse
# deb-src https://mirrors.tuna.tsinghua.edu.cn/ubuntu/ focal main restricted universe multiverse
deb https://mirrors.tuna.tsinghua.edu.cn/ubuntu/ focal-updates main restricted universe multiverse
# deb-src https://mirrors.tuna.tsinghua.edu.cn/ubuntu/ focal-updates main restricted universe multiverse
deb https://mirrors.tuna.tsinghua.edu.cn/ubuntu/ focal-backports main restricted universe multiverse
# deb-src https://mirrors.tuna.tsinghua.edu.cn/ubuntu/ focal-backports main restricted universe multiverse
deb https://mirrors.tuna.tsinghua.edu.cn/ubuntu/ focal-security main restricted universe multiverse
# deb-src https://mirrors.tuna.tsinghua.edu.cn/ubuntu/ focal-security main restricted universe multiverse
# 预发布软件源,不建议启用
# deb https://mirrors.tuna.tsinghua.edu.cn/ubuntu/ focal-proposed main restricted universe multiverse
# deb-src https://mirrors.tuna.tsinghua.edu.cn/ubuntu/ focal-proposed main restricted universe multiverse
- 安装依赖
apt update && apt install ca-certificates -y
- 在容器终端内执行 [[cp /data/sources.list /etc/apt/]]
- 更新软件包列表 [[apt update]]
安装依赖软件
容器内终端执行[[apt install vim net-tools lrzsz]]
配置 jdk8
1.容器内终端执行[[apt install openjdk-8-jdk -y]]
root@4b2705f587d8:/# java -version
openjdk version "1.8.0_252"
OpenJDK Runtime Environment (build 1.8.0_252-8u252-b09-1ubuntu1-b09)
OpenJDK 64-Bit Server VM (build 25.252-b09, mixed mode)
root@4b2705f587d8:/# apt install openjdk-8-jdk -y
2.配置JAVA_HOME . 容器内终端打开 vim $HOME/.bashrc
, 文件底部写入export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/
3. 刷新 source $HOME/.bashrc && echo $JAVA_HOME
配置 ssh
输入 Asia
和 Shanghai
1 安装 apt install ssh
2 启动 service ssh start
3 配置 ssh local 互信
3.1、ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
3.2、cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
3.3、chmod 0600 ~/.ssh/authorized_keys
3.4、[[ssh localhost]]
保存以上改动到 bigdata 镜像
4b2705f587d8 为 容器 id,不同容器可能会不同
宿主机内执行docker commit 4b2705f587d8 jansora/bigdata:base
更换容器
- 保存好后退出当前容器(容器内执行 [[ctrl + d]])
- 启动新的容器 [[docker run -it --network bigdata --mount source=bigdata,target=/data jansora/bigdata]]
配置 Hadoop
下载安装Hadoop
- 宿主机下载
hadoop-2.10.0
[[wget https://mirrors.bfsu.edu.cn/apache/hadoop/common/hadoop-2.10.0/hadoop-2.10.0.tar.gz]] - 解压:
tar xvf hadoop-2.10.0.tar.gz
- 移动到数据卷 [[sudo mv hadoop-2.10.0 /var/lib/docker/volumes/bigdata/_data]]
- 在容器内配置
HADOOP_HOME
. 容器内终端打开vim $HOME/.bashrc
, 文件底部写入export HADOOP_HOME=/data/hadoop-2.10.0
- 配置 sh 启动脚本确保可以root启动
在start-dfs.sh
中追加:
HDFS_DATANODE_USER=root
HADOOP_SECURE_DN_USER=hdfs
HDFS_NAMENODE_USER=root
HDFS_SECONDARYNAMENODE_USER=root
在start-yarn.sh
中
YARN_RESOURCEMANAGER_USER=root
HADOOP_SECURE_DN_USER=yarn
YARN_NODEMANAGER_USER=root
配置 hadoop etc/hadoop/hadoop-env.sh
:
追加
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/
配置 etc/hadoop/core-site.xml:
创建文件夹 [[mkdir -p ${HADOOP_HOME}/data/temp]]
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://master:19230</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>${HADOOP_HOME}/data/temp</value>
</property>
<property>
<name>io.file.buffer.size</name>
<value>4096</value>
</property>
<property>
<name>fs.trash.interval</name>
<value>10080</value>
</property>
</configuration>
配置 etc/hadoop/mapred-site.xml:
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>mapreduce.job.ubertask.enable</name>
<value>true</value>
</property>
<property>
<name>mapreduce.jobhistory.address</name>
<value>master:19238</value>
</property>
<property>
<name>mapreduce.jobhistory.webapp.address</name>
<value>master:19239</value>
</property>
<property>
<name>yarn.app.mapreduce.am.env</name>
<value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>
</property>
<property>
<name>mapreduce.map.env</name>
<value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>
</property>
<property>
<name>mapreduce.reduce.env</name>
<value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>
</property>
<property>
<name>mapreduce.application.classpath</name>
<value>$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/*,$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/lib/*</value>
</property>
</configuration>
配置 etc/hadoop/yarn-site.xml:
<configuration>
<!-- Site specific YARN configuration properties -->
<property>
<name>yarn.resourcemanager.hostname</name>
<value>master</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.resourcemanager.address</name>
<value>master:18040</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.address</name>
<value>master:18030</value>
</property>
<property>
<name>yarn.resourcemanager.resource-tracker.address</name>
<value>master:18025</value>
</property>
<property>
<name>yarn.resourcemanager.admin.address</name>
<value>master:18141</value>
</property>
<property>
<name>yarn.resourcemanager.webapp.address</name>
<value>master:18088</value>
</property>
</configuration>
配置 etc/hadoop/hdfs-site.xml
<configuration>
<property>
<name>dfs.namenode.secondary.http-address</name>
<value>master:19231</value>
</property>
<property>
<name>dfs.namenode.http-address</name>
<value>master:19232</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:///hdfs/data/namenodeDatas</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:///hdfs/data/datanodeDatas</value>
</property>
<property>
<name>dfs.namenode.edits.dir</name>
<value>file:///hdfs/data/dfs/nn/edits</value>
</property>
<property>
<name>dfs.namenode.checkpoint.dir</name>
<value>file:///hdfs/data/dfs/snn/name</value>
</property>
<property>
<name>dfs.namenode.checkpoint.edits.dir</name>
<value>file:///hdfs/data/dfs/nn/snn/edits</value>
</property>
<property>
<name>dfs.permissions</name>
<value>false</value>
</property>
<!--指定block块的的大小-->
<property>
<name>dfs.blocksize</name>
<value>134217728</value>
</property>
<!-- -->
<property>
<name>dfs.namenode.handler.count</name>
<value>100</value>
</property>
<!--block的副本数-->
<property>
<name>dfs.replication</name>
<value>3</value>
</property>
</configuration>
配置 etc/hadoop/slaves
配置机器集群, 需要注意的是,这里的每行都是一个容器,docker启动的时候需要指定 --name
集群主机启动的时候需要指定 docker run -it --name master --......
集群从机1启动的时候需要指定 docker run -it --name slave1 --......
集群从机2启动的时候需要指定 docker run -it --name slave2 --......
... 需要启动多少从机就指定多少, 与 slaves 对应
master
slave1
slave2
slave3
......
启动
./start-all.sh
验证安装状态
root@master:/data/hadoop-2.10.0/etc/hadoop# hadoop fs -mkdir /input
root@master:/data/hadoop-2.10.0/etc/hadoop# echo 'test' > test.txt
root@master:/data/hadoop-2.10.0/etc/hadoop# hadoop fs -put test.txt /input/
root@master:/data/hadoop-2.10.0/etc/hadoop# hadoop fs -ls /input
Found 1 items
-rw-r--r-- 3 root supergroup 5 2020-06-10 15:20 /input/test.txt
root@master:/data/hadoop-2.10.0/etc/hadoop# hadoop fs -cat /input/test.txt
test
保存以上改动到 bigdata 镜像
4b2705f587d8 为 容器 id,不同容器可能会不同
宿主机内执行docker commit 4b2705f587d8 jansora/bigdata:base-hadoop
更换容器
- 保存好后退出当前容器(容器内执行 [[ctrl + d]])
- 启动新的容器 [[docker run -it --network bigdata --mount source=bigdata,target=/data --publish 19231:19231 --publish 19232:19232 --publish 19233:19233 --publish 19234:19234 --publish 19235:19235 --publish 19236:19236 --publish 19237:19237 --publish 19238:19238 --publish 19239:19239 --publish 19240:19240 --publish 19241:19241 jansora/bigdata]]
问题追踪
Hadoop查看Secondary Namenode管理页面无信息的解决办法
https://blog.csdn.net/u012834750/article/details/80508464
配置Hive
下载hive
[[wget https://mirrors.tuna.tsinghua.edu.cn/apache/hive/hive-2.3.7/apache-hive-2.3.7-bin.tar.gz]]
配置环境变量
hadoop 与以上保持一致
配置环境变量 [[vim ~/.bashrc]]
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/
export HADOOP_HOME=/data/hadoop-2.10.0
export HIVE_HOME=/data/apache-hive-2.3.7-bin
export PATH=$HADOOP_HOME/bin:$PATH
export PATH=$HADOOP_HOME/sbin:$PATH
export PATH=$HIVE_HOME/bin:$PATH
service ssh start
配置 metascore
- 使用默认的 derby 来初始化
[[schematool -dbType derby -initSchema]]
derby 会把原信息保存到执行命令的当前目录下的
metastore_db
目录, 因此后续启动hive 需要从当前目录执行.
- 使用默认的 mysql 来初始化
配置 MySQL驱动
这里需要配置一个外部的MySQL, 如何安装MySQL此处不解释了
假定 mysql 端口号=1688 用户名=root 密码=123456 且已开启远程访问
2.1 下载mysql驱动
选择mysql 驱动 https://downloads.mysql.com/archives/c-j/ , 最新版本即可
- [[wget https://downloads.mysql.com/archives/get/p/3/file/mysql-connector-java-8.0.19.tar.gz]]
- 解压后放到
$HIVE_HOME/lib
root@master:/data/apache-hive-2.3.7-bin/lib# ls -l
total 2304
-rw-r--r-- 1 root root 2356711 Dec 4 2019 mysql-connector-java-8.0.19.jar
2.2 配置 hive-site.xml
$HIVE_HOME/conf
新建 hive-site.xml
, 更多配置项可参考 $HIVE_HOME/conf/hive-default.xml.template
<configuration>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://192.168.31.231:1688/hive?createDatabaseIfNotExist=true</value>
<description>JDBC connect string for a JDBC metastore</description>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
<description>Driver class name for a JDBC metastore</description>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>root</value>
<description>username to use against metastore database</description>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>123456</value>
<description>password to use against metastore database</description>
</property>
<property>
<name>datanucleus.autoCreateSchema</name>
<value>true</value>
</property>
<property>
<name>datanucleus.autoCreateTables</name>
<value>true</value>
</property>
<property>
<name>datanucleus.autoCreateColumns</name>
<value>true</value>
</property>
<!-- 设置 hive仓库的HDFS上的位置 -->
<property>
<name>hive.metastore.warehouse.dir</name>
<value>/hive</value>
<description>location of default database for the warehouse</description>
</property>
<!--资源临时文件存放位置 -->
<property>
<name>hive.downloaded.resources.dir</name>
<value>/tmp/hive/resources</value>
<description>Temporary local directory for added resources in the remote file system.</description>
</property>
<!-- Hive在0.9版本之前需要设置hive.exec.dynamic.partition为true, Hive在0.9版本之后默认为true -->
<property>
<name>hive.exec.dynamic.partition</name>
<value>true</value>
</property>
<property>
<name>hive.exec.dynamic.partition.mode</name>
<value>nonstrict</value>
</property>
<property>
<name>hive.server2.thrift.bind.host</name>
<value>127.0.0.1</value>
</property>
<property>
<name>hive.server2.thrift.port</name>
<value>10000</value>
</property>
<property>
<name>hive.server2.thrift.http.port</name>
<value>10001</value>
</property>
<property>
<name>hive.server2.thrift.http.path</name>
<value>cliservice</value>
</property>
<!-- HiveServer2的WEB UI -->
<property>
<name>hive.server2.webui.host</name>
<value>127.0.0.1</value>
</property>
<property>
<name>hive.server2.webui.port</name>
<value>19238</value>
</property>
<property>
<name>hive.scratch.dir.permission</name>
<value>755</value>
</property>
<property>
<name>hive.server2.enable.doAs</name>
<value>false</value>
</property>
<!-- property>
<name>hive.server2.authentication</name>
<value>NOSASL</value>
</property -->
<property>
<name>hive.auto.convert.join</name>
<value>false</value>
</property>
</configuration>
启动Hive
以命令行启动 [[hive]]
以后台服务启动 [[nohup hive --service hiveserver2 liap metascore &]]
root@master:/# hive --service --help
Available Services:
beeline cleardanglingscratchdir cli hbaseimport hbaseschematool help
hiveburninclient hiveserver2 hplsql jar lineage llap llapdump llapstatus
metastore metatool orcfiledump rcfilecat schemaTool version
root@master:/data/apache-hive-2.3.7-bin/bin# hive
hive> show databases;
OK
default
Time taken: 3.171 seconds, Fetched: 1 row(s)
hive> create database abc;
OK
Time taken: 2.634 seconds
hive> show databases;
OK
abc
default
保存以上改动到 bigdata 镜像
4b2705f587d8 为 容器 id,不同容器可能会不同
宿主机内执行docker commit 4b2705f587d8 jansora/bigdata:hadoop-hive
更换容器
- 保存好后退出当前容器(容器内执行 [[ctrl + d]])
- 启动新的容器 [[docker run -it --network bigdata --mount source=bigdata,target=/data --publish 19231:19231 --publish 19232:19232 --publish 19233:19233 --publish 19234:19234 --publish 19235:19235 --publish 19236:19236 --publish 19237:19237 --publish 19238:19238 --publish 19239:19239 --publish 19240:19240 --publish 19241:19241 jansora/bigdata]]
安装 spark
下载 spark
[[wget https://mirrors.tuna.tsinghua.edu.cn/apache/spark/spark-2.4.5/spark-2.4.5-bin-hadoop2.7.tgz]]
配置环境变量
在.bashrc添加如下配置:
export SPARK_HOME=/data/spark-2.4.5-bin-without-hadoop
配置生效 source ~/.bashrc
配置 slave 主从节点
参考 $HADOOP_HOME/conf/slaves
. 配一样即可
配置 $SPARK_HOME/conf/spark-env.sh
export SPARK_DIST_CLASSPATH=$(/data/hadoop-2.10.0/bin/hadoop classpath)
export HADOOP_CONF_DIR=/data/hadoop-2.10.0
export SPARK_MASTER_IP=master
export SPARK_MASTER_WEBUI_PORT=19241
启动 spark
[[cd $SPARK_HOME/sbin && ./start-all.sh]]
保存以上改动到 bigdata 镜像
4b2705f587d8 为 容器 id,不同容器可能会不同
宿主机内执行docker commit 4b2705f587d8 jansora/bigdata:hadoop-hive-spark
更换容器
- 保存好后退出当前容器(容器内执行 [[ctrl + d]])
- 启动新的容器 [[docker run -it --network bigdata --mount source=bigdata,target=/data --publish 19231:19231 --publish 19232:19232 --publish 19233:19233 --publish 19234:19234 --publish 19235:19235 --publish 19236:19236 --publish 19237:19237 --publish 19238:19238 --publish 19239:19239 --publish 19240:19240 --publish 19241:19241 jansora/bigdata]]
安装 hbase
下载 hbase
[[wget https://mirrors.tuna.tsinghua.edu.cn/apache/hbase/2.2.5/hbase-2.2.5-bin.tar.gz]]
配置环境变量
在.bashrc添加如下配置:
export HBASE_HOME=/data/hbase-2.2.5
export PATH=$HBASE_HOME/bin:$PATH
配置 hbase-site.xml
在原有内容下加入以下内容, 启用web端口
<property>
<name>hbase.master.info.port</name>
<value>19241</value>
</property>
启动hbase
[[start-hbase.sh]]
取消文件挂载
- 拷贝
data
: [[cp -r /data /data.bak]] - 宿主机保存镜像: 宿主机内执行
docker commit 4b2705f587d8 jansora/bigdata:hadoop-hive-spark-hbase-with-dependencies
- 退出容器, 进入容器
jansora/bigdata:hadoop-hive-spark-hbase-with-dependencies
4.拷贝data
: [[mv /data.bak /data]] - 宿主机保存镜像: 宿主机内执行
docker commit 4b2705f587d8 jansora/bigdata:hadoop-hive-spark-hbase-with-dependencies
- push 到 docker 容器
docker push jansora/bigdata:hadoop-hive-spark-hbase-with-dependencies