Hadoop – Big Data in a Small Lab

Bom dia a todos,

Desde o ultimo post sobre ElasticSearch, chegou a a altura de fazer um post sobre um tema que tem vindo a ser pedido repetidamente: Hadoop.

É uma solução/framework de que permite o armazenamento de Big Data num ambiente distribuído, de forma a que possa ser processado paralelamente.
Com o selo de qualidade da Apache Foundation, é algo que está em high demand no mercado nacional e internacional.

Sei que existem colegas interessados na tecnologia, pela buzzword, em testar o que é, para que serve, como funciona e como poderei o integrar nos TI/SI da minha empresa.

Em primeiro caso é importante saber o que é, e o que não é o Hadoop:

O Apache Hadoop é uma framework de software que permite correr processos distribuídos em grandes datasets, através de farms computacionais usando apenas modelos de programação simples.

É pensado e desenhado para escalar de um ou dois servidores para milhares de máquinas, conseguindo assim utilizar os recursos computacionais e de storage.

Existem casos específicos onde o Hadoop brilha, por exemplo na escalabilidade (hoje termos uma BigData de 2TB e daqui a 10 anos de 2000PB sem alteração do paradigma), frameworks extremamente diversificadas (mongoDB, spark, mahout, etc) e disponibilidade de dados pela sua natureza distribuída.

Contudo existem downsides:

O Hadoop por ter no seu core uma framework de batch não poderá ser utilizada para obter dados em tempo real (ou near real time),  é muito susceptível de sofrer lentidão em casos em que a informação se encontre fragmentada (por exemplo 1024 ficheiros de 1TB versus 1 ficheiro de 1PB), e não é de forma alguma um substituto para a tradicional data warehouse.


Igualmente em casos onde a segurança dos dados seja importante, existem componentes extras como o Apache Accumulo que podem ser integradas em Hadoop de forma a garantir um nível desejado de criptografia e segurança.

Existe um artigo muito completo sobre o tema e os use cases que pode ser consultado aqui.

No final do dia, o Hadoop serve para armazenar, e depurar dados através de Map Reduce até se conseguirem dados úteis, desde instancias de big data, e apresentar os resultados de forma concisa, e simples.

Faz parte de um ecosistemas aplicacional que envolve desde a tecnologia de map reduce, managment, coordination, scheduling, noSQL DB, scripting, machine learning, etc.

As componentes principais do Hadoop são o Yarn (batch processing) e o HDFS (storage). Este post é referente ao componente de storage. Se quiserem ler mais sobre a componente do Yarn (batch) a IBM tem um artigo muito profundo aqui.

E agora, depois de uma introdução anormalmente grande, iremos implementar o nosso lab 🙂

Para o nosso caso iremos ter dois slave nodes (que processam) e um master (coordinator):

lxcHadoopMaster   RUNNING  10.0.1.50           -     NO         
lxcHadoopSlave01  RUNNING  10.0.1.51           -     NO         
lxcHadoopSlave02  RUNNING  10.0.1.52           -     NO

Em primeiro lugar será necessário garantir os pré requisitos necessários que são os seguintes:

Instalar java. As versões suportadas podem ser consultadas aqui.

# curl -LO -H "Cookie: oraclelicense=accept-securebackup-cookie" “http://download.oracle.com/otn-pub/java/jdk/8u162-b14/jdk-8u162-linux-x64.rpm”
# yum localinstall jdk-8u162-linux-x64.rpm
# java -version

java version "1.8.0_162"
Java(TM) SE Runtime Environment (build 1.8.0_162-b12)
Java HotSpot(TM) 64-Bit Server VM (build 25.162-b12, mixed mode)

Criar um utilizador aplicacional onde a solução será executada.

# useradd hadoop
# passwd hadoop
Changing password for user hadoop.
New password: 
Retype new password: 
passwd: all authentication tokens updated successfully.

Adicionar FQDN mapping (ou configurar DNS para o efeito).

# vi /etc/hosts
10.0.1.50 lxchadoopmaster
10.0.1.51 lxchadoopslave01
10.0.1.52 lxchadoopslave02

Gerar chaves privadas para gerir os scripts remotos de Hadoop.
Nota: embora não seja necessário para os nós comunicarem uns com os outros, irá ser de grande utilidade na gestão do nosso cluster.

# su - hadoop
$ ssh-keygen -t rsa
$ ssh-copy-id -I ~/.ssh/id_rsa.pub hadoop@lxchadoopmaster
$ ssh-copy-id -I ~/.ssh/id_rsa.pub hadoop@lxchadoopslave01
$ ssh-copy-id -I ~/.ssh/id_rsa.pub hadoop@lxchadoopslave02
$ chmod 0600 ~/.ssh/authorized_keys

Adicionar software de sistema que irá ser necessário mais a frente:

# yum -y install wget rsync which

Descarregar o software aplicacional em si e instalar ele na sua diretoria:

mkdir -p /opt/hadoop
chown hadoop:hadoop /opt/hadoop
su - hadoop
cd /opt/hadoop
wget http://www-us.apache.org/dist/hadoop/common/stable/hadoop-2.9.0.tar.gz
tar xzf hadoop-2.9.0.tar.gz

Em seguida será necessário configurar as variáveis de ambiente para o utilizador aplicacional que irá executar a solução:

export HADOOP_HOME=/opt/hadoop
export HADOOP_INSTALL=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin

Recomendo que as variáveis sejam carregadas no bashrc ou no profile do utilizador.
Em seguida, editar o ficheiro $HADOOP_HOME/etc/hadoop/hadoop-env.sh e o JAVA_HOME conforme configurado nos vossos sistemas:

export JAVA_HOME=/usr/java/jdk1.8.0_162/jre/
export HADOOP_OPTS=-Djava.net.preferIPv4Stack=true

Finalmente chegou a altura de configurar o nosso Hadoop.

cd $HADOOP_HOME/etc/hadoop

Editar o ficheiro core-site.xml

# vi core-site.xml
# Adicionar dentro da tag configuration
<property>
    <name>fs.default.name</name>
    <value>hdfs://lxchadoopmaster:9000/</value>
</property>
<property>
    <name>dfs.permissions</name>
    <value>false</value>
</property>

Editar o ficheiro hdfs-site.xml

# vi hdfs-site.xml
# Adicionar dentro da tag configuration
<property>
	<name>dfs.data.dir</name>
	<value>/opt/hadoop/dfs/name/data</value>
	<final>true</final>
</property>
<property>
	<name>dfs.name.dir</name>
	<value>/opt/hadoop/dfs/name</value>
	<final>true</final>
</property>
<property>
	<name>dfs.replication</name>
	<value>1</value>
</property>

Editar (ou criar o ficheiro – depende da versão de Hadoop que estejam a instalar) mapred-site.xml

# vi mapred-site.xml
# Adicionar dentro da tag configuration
<property>
        <name>mapred.job.tracker</name>
	<value>lxchadoopslave01:9001</value>
</property>
<property> 
        <name>mapred.job.tracker</name> 
        <value>lxchadoopslave02:9001</value> 
</property>

Enviar a nossa configuração para todos os nós da solução:

# su - hadoop
$ rsync -auvx $HADOOP_HOME lxchadoopslave01:/opt/
$ rsync -auvx $HADOOP_HOME lxchadoopslave02:/opt/

Configurações especificas do Master Node:

# su - hadoop
$ cd $HADOOP_HOME/etc/hadoop
$ vi slaves
lxchadoopslave1
lxchadoopslave2

Em seguida inicializar o name node no master:

# su - hadoop
$hadoop namenode -format

O resultado será algo como:

[hadoop@lxcHadoopMaster ~]$ hadoop namenode -format

DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.

18/02/20 12:32:57 INFO namenode.NameNode: STARTUP_MSG: 
/************************************************************

STARTUP_MSG: Starting NameNode
STARTUP_MSG:   host = lxchadoopmaster/10.0.1.50
STARTUP_MSG:   args = [-format]
STARTUP_MSG:   version = 2.9.0

STARTUP_MSG:   classpath = /opt/hadoop/etc/hadoop:/opt/hadoop/share/hadoop/common/lib/nimbus-jose-jwt-3.9.jar:/opt/hadoop/share/hadoop/common/lib/java-xmlbuilder-0.4.jar:/opt/hadoop/share/hadoop/common/lib/commons-configuration-1.6.jar:/opt/hadoop/share/hadoop/common/lib/commons-cli-1.2.jar:/opt/hadoop/share/hadoop/common/lib/commons-net-3.1.jar:/opt/hadoop/share/hadoop/common/lib/jersey-core-1.9.jar:/opt/hadoop/share/hadoop/common/lib/guava-11.0.2.jar:/opt/hadoop/share/hadoop/common/lib/gson-2.2.4.jar:/opt/hadoop/share/hadoop/common/lib/jackson-core-asl-1.9.13.jar:/opt/hadoop/share/hadoop/common/lib/log4j-1.2.17.jar:/opt/hadoop/share/hadoop/common/lib/woodstox-core-5.0.3.jar:/opt/hadoop/share/hadoop/common/lib/mockito-all-1.8.5.jar:/opt/hadoop/share/hadoop/common/lib/jettison-1.1.jar:/opt/hadoop/share/hadoop/common/lib/commons-digester-1.8.jar:/opt/hadoop/share/hadoop/common/lib/stax2-api-3.1.4.jar:/opt/hadoop/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar:/opt/hadoop/share/hadoop/common/lib/xz-1.0.jar:/opt/hadoop/share/hadoop/common/lib/jackson-jaxrs-1.9.13.jar:/opt/hadoop/share/hadoop/common/lib/hadoop-auth-2.9.0.jar:/opt/hadoop/share/hadoop/common/lib/apacheds-i18n-2.0.0-M15.jar:/opt/hadoop/share/hadoop/common/lib/stax-api-1.0-2.jar:/opt/hadoop/share/hadoop/common/lib/jetty-6.1.26.jar:/opt/hadoop/share/hadoop/common/lib/api-asn1-api-1.0.0-M20.jar:/opt/hadoop/share/hadoop/common/lib/protobuf-java-2.5.0.jar:/opt/hadoop/share/hadoop/common/lib/curator-recipes-2.7.1.jar:/opt/hadoop/share/hadoop/common/lib/commons-io-2.4.jar:/opt/hadoop/share/hadoop/common/lib/curator-framework-2.7.1.jar:/opt/hadoop/share/hadoop/common/lib/api-util-1.0.0-M20.jar:/opt/hadoop/share/hadoop/common/lib/json-smart-1.1.1.jar:/opt/hadoop/share/hadoop/common/lib/commons-lang-2.6.jar:/opt/hadoop/share/hadoop/common/lib/apacheds-kerberos-codec-2.0.0-M15.jar:/opt/hadoop/share/hadoop/common/lib/commons-collections-3.2.2.jar:/opt/hadoop/share/hadoop/common/lib/jersey-server-1.9.jar:/opt/hadoop/share/hadoop/common/lib/jets3t-0.9.0.jar:/opt/hadoop/share/hadoop/common/lib/xmlenc-0.52.jar:/opt/hadoop/share/hadoop/common/lib/hamcrest-core-1.3.jar:/opt/hadoop/share/hadoop/common/lib/servlet-api-2.5.jar:/opt/hadoop/share/hadoop/common/lib/commons-logging-1.1.3.jar:/opt/hadoop/share/hadoop/common/lib/commons-codec-1.4.jar:/opt/hadoop/share/hadoop/common/lib/paranamer-2.3.jar:/opt/hadoop/share/hadoop/common/lib/jcip-annotations-1.0.jar:/opt/hadoop/share/hadoop/common/lib/zookeeper-3.4.6.jar:/opt/hadoop/share/hadoop/common/lib/activation-1.1.jar:/opt/hadoop/share/hadoop/common/lib/jsch-0.1.54.jar:/opt/hadoop/share/hadoop/common/lib/commons-beanutils-core-1.8.0.jar:/opt/hadoop/share/hadoop/common/lib/jersey-json-1.9.jar:/opt/hadoop/share/hadoop/common/lib/htrace-core4-4.1.0-incubating.jar:/opt/hadoop/share/hadoop/common/lib/jackson-mapper-asl-1.9.13.jar:/opt/hadoop/share/hadoop/common/lib/jetty-util-6.1.26.jar:/opt/hadoop/share/hadoop/common/lib/commons-math3-3.1.1.jar:/opt/hadoop/share/hadoop/common/lib/junit-4.11.jar:/opt/hadoop/share/hadoop/common/lib/hadoop-annotations-2.9.0.jar:/opt/hadoop/share/hadoop/common/lib/slf4j-api-1.7.25.jar:/opt/hadoop/share/hadoop/common/lib/httpcore-4.4.4.jar:/opt/hadoop/share/hadoop/common/lib/commons-compress-1.4.1.jar:/opt/hadoop/share/hadoop/common/lib/commons-lang3-3.4.jar:/opt/hadoop/share/hadoop/common/lib/asm-3.2.jar:/opt/hadoop/share/hadoop/common/lib/jetty-sslengine-6.1.26.jar:/opt/hadoop/share/hadoop/common/lib/jaxb-api-2.2.2.jar:/opt/hadoop/share/hadoop/common/lib/jaxb-impl-2.2.3-1.jar:/opt/hadoop/share/hadoop/common/lib/commons-beanutils-1.7.0.jar:/opt/hadoop/share/hadoop/common/lib/avro-1.7.7.jar:/opt/hadoop/share/hadoop/common/lib/httpclient-4.5.2.jar:/opt/hadoop/share/hadoop/common/lib/jsp-api-2.1.jar:/opt/hadoop/share/hadoop/common/lib/curator-client-2.7.1.jar:/opt/hadoop/share/hadoop/common/lib/jsr305-3.0.0.jar:/opt/hadoop/share/hadoop/common/lib/netty-3.6.2.Final.jar:/opt/hadoop/share/hadoop/common/lib/snappy-java-1.0.5.jar:/opt/hadoop/share/hadoop/common/lib/jackson-xc-1.9.13.jar:/opt/hadoop/share/hadoop/common/hadoop-common-2.9.0.jar:/opt/hadoop/share/hadoop/common/hadoop-nfs-2.9.0.jar:/opt/hadoop/share/hadoop/common/hadoop-common-2.9.0-tests.jar:/opt/hadoop/share/hadoop/hdfs:/opt/hadoop/share/hadoop/hdfs/lib/okhttp-2.4.0.jar:/opt/hadoop/share/hadoop/hdfs/lib/commons-cli-1.2.jar:/opt/hadoop/share/hadoop/hdfs/lib/jersey-core-1.9.jar:/opt/hadoop/share/hadoop/hdfs/lib/guava-11.0.2.jar:/opt/hadoop/share/hadoop/hdfs/lib/xml-apis-1.3.04.jar:/opt/hadoop/share/hadoop/hdfs/lib/jackson-core-2.7.8.jar:/opt/hadoop/share/hadoop/hdfs/lib/jackson-core-asl-1.9.13.jar:/opt/hadoop/share/hadoop/hdfs/lib/log4j-1.2.17.jar:/opt/hadoop/share/hadoop/hdfs/lib/netty-all-4.0.23.Final.jar:/opt/hadoop/share/hadoop/hdfs/lib/jackson-annotations-2.7.8.jar:/opt/hadoop/share/hadoop/hdfs/lib/leveldbjni-all-1.8.jar:/opt/hadoop/share/hadoop/hdfs/lib/jetty-6.1.26.jar:/opt/hadoop/share/hadoop/hdfs/lib/commons-daemon-1.0.13.jar:/opt/hadoop/share/hadoop/hdfs/lib/protobuf-java-2.5.0.jar:/opt/hadoop/share/hadoop/hdfs/lib/commons-io-2.4.jar:/opt/hadoop/share/hadoop/hdfs/lib/commons-lang-2.6.jar:/opt/hadoop/share/hadoop/hdfs/lib/jersey-server-1.9.jar:/opt/hadoop/share/hadoop/hdfs/lib/xmlenc-0.52.jar:/opt/hadoop/share/hadoop/hdfs/lib/servlet-api-2.5.jar:/opt/hadoop/share/hadoop/hdfs/lib/commons-logging-1.1.3.jar:/opt/hadoop/share/hadoop/hdfs/lib/commons-codec-1.4.jar:/opt/hadoop/share/hadoop/hdfs/lib/jackson-databind-2.7.8.jar:/opt/hadoop/share/hadoop/hdfs/lib/okio-1.4.0.jar:/opt/hadoop/share/hadoop/hdfs/lib/htrace-core4-4.1.0-incubating.jar:/opt/hadoop/share/hadoop/hdfs/lib/hadoop-hdfs-client-2.9.0.jar:/opt/hadoop/share/hadoop/hdfs/lib/jackson-mapper-asl-1.9.13.jar:/opt/hadoop/share/hadoop/hdfs/lib/jetty-util-6.1.26.jar:/opt/hadoop/share/hadoop/hdfs/lib/xercesImpl-2.9.1.jar:/opt/hadoop/share/hadoop/hdfs/lib/asm-3.2.jar:/opt/hadoop/share/hadoop/hdfs/lib/jsr305-3.0.0.jar:/opt/hadoop/share/hadoop/hdfs/lib/netty-3.6.2.Final.jar:/opt/hadoop/share/hadoop/hdfs/hadoop-hdfs-client-2.9.0-tests.jar:/opt/hadoop/share/hadoop/hdfs/hadoop-hdfs-2.9.0.jar:/opt/hadoop/share/hadoop/hdfs/hadoop-hdfs-nfs-2.9.0.jar:/opt/hadoop/share/hadoop/hdfs/hadoop-hdfs-native-client-2.9.0-tests.jar:/opt/hadoop/share/hadoop/hdfs/hadoop-hdfs-client-2.9.0.jar:/opt/hadoop/share/hadoop/hdfs/hadoop-hdfs-2.9.0-tests.jar:/opt/hadoop/share/hadoop/hdfs/hadoop-hdfs-native-client-2.9.0.jar:/opt/hadoop/share/hadoop/yarn:/opt/hadoop/share/hadoop/yarn/lib/nimbus-jose-jwt-3.9.jar:/opt/hadoop/share/hadoop/yarn/lib/javassist-3.18.1-GA.jar:/opt/hadoop/share/hadoop/yarn/lib/java-xmlbuilder-0.4.jar:/opt/hadoop/share/hadoop/yarn/lib/guice-servlet-3.0.jar:/opt/hadoop/share/hadoop/yarn/lib/commons-configuration-1.6.jar:/opt/hadoop/share/hadoop/yarn/lib/commons-cli-1.2.jar:/opt/hadoop/share/hadoop/yarn/lib/commons-net-3.1.jar:/opt/hadoop/share/hadoop/yarn/lib/jersey-core-1.9.jar:/opt/hadoop/share/hadoop/yarn/lib/guava-11.0.2.jar:/opt/hadoop/share/hadoop/yarn/lib/gson-2.2.4.jar:/opt/hadoop/share/hadoop/yarn/lib/jackson-core-asl-1.9.13.jar:/opt/hadoop/share/hadoop/yarn/lib/log4j-1.2.17.jar:/opt/hadoop/share/hadoop/yarn/lib/woodstox-core-5.0.3.jar:/opt/hadoop/share/hadoop/yarn/lib/jettison-1.1.jar:/opt/hadoop/share/hadoop/yarn/lib/commons-digester-1.8.jar:/opt/hadoop/share/hadoop/yarn/lib/HikariCP-java7-2.4.12.jar:/opt/hadoop/share/hadoop/yarn/lib/stax2-api-3.1.4.jar:/opt/hadoop/share/hadoop/yarn/lib/xz-1.0.jar:/opt/hadoop/share/hadoop/yarn/lib/curator-test-2.7.1.jar:/opt/hadoop/share/hadoop/yarn/lib/jackson-jaxrs-1.9.13.jar:/opt/hadoop/share/hadoop/yarn/lib/geronimo-jcache_1.0_spec-1.0-alpha-1.jar:/opt/hadoop/share/hadoop/yarn/lib/leveldbjni-all-1.8.jar:/opt/hadoop/share/hadoop/yarn/lib/apacheds-i18n-2.0.0-M15.jar:/opt/hadoop/share/hadoop/yarn/lib/stax-api-1.0-2.jar:/opt/hadoop/share/hadoop/yarn/lib/jetty-6.1.26.jar:/opt/hadoop/share/hadoop/yarn/lib/api-asn1-api-1.0.0-M20.jar:/opt/hadoop/share/hadoop/yarn/lib/protobuf-java-2.5.0.jar:/opt/hadoop/share/hadoop/yarn/lib/curator-recipes-2.7.1.jar:/opt/hadoop/share/hadoop/yarn/lib/jersey-client-1.9.jar:/opt/hadoop/share/hadoop/yarn/lib/commons-io-2.4.jar:/opt/hadoop/share/hadoop/yarn/lib/curator-framework-2.7.1.jar:/opt/hadoop/share/hadoop/yarn/lib/api-util-1.0.0-M20.jar:/opt/hadoop/share/hadoop/yarn/lib/json-smart-1.1.1.jar:/opt/hadoop/share/hadoop/yarn/lib/commons-math-2.2.jar:/opt/hadoop/share/hadoop/yarn/lib/commons-lang-2.6.jar:/opt/hadoop/share/hadoop/yarn/lib/apacheds-kerberos-codec-2.0.0-M15.jar:/opt/hadoop/share/hadoop/yarn/lib/commons-collections-3.2.2.jar:/opt/hadoop/share/hadoop/yarn/lib/jersey-server-1.9.jar:/opt/hadoop/share/hadoop/yarn/lib/ehcache-3.3.1.jar:/opt/hadoop/share/hadoop/yarn/lib/jets3t-0.9.0.jar:/opt/hadoop/share/hadoop/yarn/lib/xmlenc-0.52.jar:/opt/hadoop/share/hadoop/yarn/lib/servlet-api-2.5.jar:/opt/hadoop/share/hadoop/yarn/lib/commons-logging-1.1.3.jar:/opt/hadoop/share/hadoop/yarn/lib/commons-codec-1.4.jar:/opt/hadoop/share/hadoop/yarn/lib/paranamer-2.3.jar:/opt/hadoop/share/hadoop/yarn/lib/java-util-1.9.0.jar:/opt/hadoop/share/hadoop/yarn/lib/javax.inject-1.jar:/opt/hadoop/share/hadoop/yarn/lib/zookeeper-3.4.6-tests.jar:/opt/hadoop/share/hadoop/yarn/lib/jcip-annotations-1.0.jar:/opt/hadoop/share/hadoop/yarn/lib/zookeeper-3.4.6.jar:/opt/hadoop/share/hadoop/yarn/lib/activation-1.1.jar:/opt/hadoop/share/hadoop/yarn/lib/json-io-2.5.1.jar:/opt/hadoop/share/hadoop/yarn/lib/jsch-0.1.54.jar:/opt/hadoop/share/hadoop/yarn/lib/jersey-guice-1.9.jar:/opt/hadoop/share/hadoop/yarn/lib/commons-beanutils-core-1.8.0.jar:/opt/hadoop/share/hadoop/yarn/lib/mssql-jdbc-6.2.1.jre7.jar:/opt/hadoop/share/hadoop/yarn/lib/jersey-json-1.9.jar:/opt/hadoop/share/hadoop/yarn/lib/htrace-core4-4.1.0-incubating.jar:/opt/hadoop/share/hadoop/yarn/lib/guice-3.0.jar:/opt/hadoop/share/hadoop/yarn/lib/jackson-mapper-asl-1.9.13.jar:/opt/hadoop/share/hadoop/yarn/lib/jetty-util-6.1.26.jar:/opt/hadoop/share/hadoop/yarn/lib/commons-math3-3.1.1.jar:/opt/hadoop/share/hadoop/yarn/lib/metrics-core-3.0.1.jar:/opt/hadoop/share/hadoop/yarn/lib/httpcore-4.4.4.jar:/opt/hadoop/share/hadoop/yarn/lib/fst-2.50.jar:/opt/hadoop/share/hadoop/yarn/lib/commons-compress-1.4.1.jar:/opt/hadoop/share/hadoop/yarn/lib/commons-lang3-3.4.jar:/opt/hadoop/share/hadoop/yarn/lib/asm-3.2.jar:/opt/hadoop/share/hadoop/yarn/lib/jetty-sslengine-6.1.26.jar:/opt/hadoop/share/hadoop/yarn/lib/jaxb-api-2.2.2.jar:/opt/hadoop/share/hadoop/yarn/lib/jaxb-impl-2.2.3-1.jar:/opt/hadoop/share/hadoop/yarn/lib/commons-beanutils-1.7.0.jar:/opt/hadoop/share/hadoop/yarn/lib/avro-1.7.7.jar:/opt/hadoop/share/hadoop/yarn/lib/httpclient-4.5.2.jar:/opt/hadoop/share/hadoop/yarn/lib/jsp-api-2.1.jar:/opt/hadoop/share/hadoop/yarn/lib/curator-client-2.7.1.jar:/opt/hadoop/share/hadoop/yarn/lib/jsr305-3.0.0.jar:/opt/hadoop/share/hadoop/yarn/lib/netty-3.6.2.Final.jar:/opt/hadoop/share/hadoop/yarn/lib/snappy-java-1.0.5.jar:/opt/hadoop/share/hadoop/yarn/lib/jackson-xc-1.9.13.jar:/opt/hadoop/share/hadoop/yarn/lib/aopalliance-1.0.jar:/opt/hadoop/share/hadoop/yarn/hadoop-yarn-server-tests-2.9.0.jar:/opt/hadoop/share/hadoop/yarn/hadoop-yarn-api-2.9.0.jar:/opt/hadoop/share/hadoop/yarn/hadoop-yarn-server-nodemanager-2.9.0.jar:/opt/hadoop/share/hadoop/yarn/hadoop-yarn-server-sharedcachemanager-2.9.0.jar:/opt/hadoop/share/hadoop/yarn/hadoop-yarn-registry-2.9.0.jar:/opt/hadoop/share/hadoop/yarn/hadoop-yarn-applications-distributedshell-2.9.0.jar:/opt/hadoop/share/hadoop/yarn/hadoop-yarn-server-common-2.9.0.jar:/opt/hadoop/share/hadoop/yarn/hadoop-yarn-server-timeline-pluginstorage-2.9.0.jar:/opt/hadoop/share/hadoop/yarn/hadoop-yarn-applications-unmanaged-am-launcher-2.9.0.jar:/opt/hadoop/share/hadoop/yarn/hadoop-yarn-server-web-proxy-2.9.0.jar:/opt/hadoop/share/hadoop/yarn/hadoop-yarn-server-applicationhistoryservice-2.9.0.jar:/opt/hadoop/share/hadoop/yarn/hadoop-yarn-server-router-2.9.0.jar:/opt/hadoop/share/hadoop/yarn/hadoop-yarn-common-2.9.0.jar:/opt/hadoop/share/hadoop/yarn/hadoop-yarn-client-2.9.0.jar:/opt/hadoop/share/hadoop/yarn/hadoop-yarn-server-resourcemanager-2.9.0.jar:/opt/hadoop/share/hadoop/mapreduce/lib/guice-servlet-3.0.jar:/opt/hadoop/share/hadoop/mapreduce/lib/jersey-core-1.9.jar:/opt/hadoop/share/hadoop/mapreduce/lib/jackson-core-asl-1.9.13.jar:/opt/hadoop/share/hadoop/mapreduce/lib/log4j-1.2.17.jar:/opt/hadoop/share/hadoop/mapreduce/lib/xz-1.0.jar:/opt/hadoop/share/hadoop/mapreduce/lib/leveldbjni-all-1.8.jar:/opt/hadoop/share/hadoop/mapreduce/lib/protobuf-java-2.5.0.jar:/opt/hadoop/share/hadoop/mapreduce/lib/commons-io-2.4.jar:/opt/hadoop/share/hadoop/mapreduce/lib/jersey-server-1.9.jar:/opt/hadoop/share/hadoop/mapreduce/lib/hamcrest-core-1.3.jar:/opt/hadoop/share/hadoop/mapreduce/lib/paranamer-2.3.jar:/opt/hadoop/share/hadoop/mapreduce/lib/javax.inject-1.jar:/opt/hadoop/share/hadoop/mapreduce/lib/jersey-guice-1.9.jar:/opt/hadoop/share/hadoop/mapreduce/lib/guice-3.0.jar:/opt/hadoop/share/hadoop/mapreduce/lib/jackson-mapper-asl-1.9.13.jar:/opt/hadoop/share/hadoop/mapreduce/lib/junit-4.11.jar:/opt/hadoop/share/hadoop/mapreduce/lib/hadoop-annotations-2.9.0.jar:/opt/hadoop/share/hadoop/mapreduce/lib/commons-compress-1.4.1.jar:/opt/hadoop/share/hadoop/mapreduce/lib/asm-3.2.jar:/opt/hadoop/share/hadoop/mapreduce/lib/avro-1.7.7.jar:/opt/hadoop/share/hadoop/mapreduce/lib/netty-3.6.2.Final.jar:/opt/hadoop/share/hadoop/mapreduce/lib/snappy-java-1.0.5.jar:/opt/hadoop/share/hadoop/mapreduce/lib/aopalliance-1.0.jar:/opt/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-client-hs-plugins-2.9.0.jar:/opt/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-client-common-2.9.0.jar:/opt/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-client-app-2.9.0.jar:/opt/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-2.9.0-tests.jar:/opt/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-client-hs-2.9.0.jar:/opt/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-2.9.0.jar:/opt/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-client-core-2.9.0.jar:/opt/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.9.0.jar:/opt/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-client-shuffle-2.9.0.jar:/opt/hadoop/contrib/capacity-scheduler/*.jar:/opt/hadoop/contrib/capacity-scheduler/*.jar

STARTUP_MSG:   build = https://git-wip-us.apache.org/repos/asf/hadoop.git -r 756ebc8394e473ac25feac05fa493f6d612e6c50; compiled by 'arsuresh' on 2017-11-13T23:15Z

STARTUP_MSG:   java = 1.8.0_162

************************************************************/
18/02/20 12:32:57 INFO namenode.NameNode: registered UNIX signal handlers for [TERM, HUP, INT]
18/02/20 12:32:57 INFO namenode.NameNode: createNameNode [-format]
18/02/20 12:32:58 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
18/02/20 12:32:59 WARN common.Util: Path /opt/hadoop/dfs/name should be specified as a URI in configuration files. Please update hdfs configuration.
18/02/20 12:32:59 WARN common.Util: Path /opt/hadoop/dfs/name should be specified as a URI in configuration files. Please update hdfs configuration.
Formatting using clusterid: CID-12c8fbd0-1eaf-4d13-8e7b-8d8e37ceb252
18/02/20 12:32:59 INFO namenode.FSEditLog: Edit logging is async:true
18/02/20 12:32:59 INFO namenode.FSNamesystem: KeyProvider: null
18/02/20 12:32:59 INFO namenode.FSNamesystem: fsLock is fair: true
18/02/20 12:32:59 INFO namenode.FSNamesystem: Detailed lock hold time metrics enabled: false
18/02/20 12:32:59 INFO namenode.FSNamesystem: fsOwner             = hadoop (auth:SIMPLE)
18/02/20 12:32:59 INFO namenode.FSNamesystem: supergroup          = supergroup
18/02/20 12:32:59 INFO namenode.FSNamesystem: isPermissionEnabled = true
18/02/20 12:32:59 INFO namenode.FSNamesystem: HA Enabled: false
18/02/20 12:32:59 INFO common.Util: dfs.datanode.fileio.profiling.sampling.percentage set to 0. Disabling file IO profiling
18/02/20 12:32:59 INFO blockmanagement.DatanodeManager: dfs.block.invalidate.limit: configured=1000, counted=60, effected=1000
18/02/20 12:32:59 INFO blockmanagement.DatanodeManager: dfs.namenode.datanode.registration.ip-hostname-check=true
18/02/20 12:32:59 INFO blockmanagement.BlockManager: dfs.namenode.startup.delay.block.deletion.sec is set to 000:00:00:00.000
18/02/20 12:32:59 INFO blockmanagement.BlockManager: The block deletion will start around 2018 Feb 20 12:32:59
18/02/20 12:32:59 INFO util.GSet: Computing capacity for map BlocksMap
18/02/20 12:32:59 INFO util.GSet: VM type       = 64-bit
18/02/20 12:32:59 INFO util.GSet: 2.0% max memory 889 MB = 17.8 MB
18/02/20 12:32:59 INFO util.GSet: capacity      = 2^21 = 2097152 entries
18/02/20 12:33:00 INFO blockmanagement.BlockManager: dfs.block.access.token.enable=false
18/02/20 12:33:00 WARN conf.Configuration: No unit for dfs.namenode.safemode.extension(30000) assuming MILLISECONDS
18/02/20 12:33:00 INFO blockmanagement.BlockManagerSafeMode: dfs.namenode.safemode.threshold-pct = 0.9990000128746033
18/02/20 12:33:00 INFO blockmanagement.BlockManagerSafeMode: dfs.namenode.safemode.min.datanodes = 0
18/02/20 12:33:00 INFO blockmanagement.BlockManagerSafeMode: dfs.namenode.safemode.extension = 30000
18/02/20 12:33:00 INFO blockmanagement.BlockManager: defaultReplication         = 1
18/02/20 12:33:00 INFO blockmanagement.BlockManager: maxReplication             = 512
18/02/20 12:33:00 INFO blockmanagement.BlockManager: minReplication             = 1
18/02/20 12:33:00 INFO blockmanagement.BlockManager: maxReplicationStreams      = 2
18/02/20 12:33:00 INFO blockmanagement.BlockManager: replicationRecheckInterval = 3000
18/02/20 12:33:00 INFO blockmanagement.BlockManager: encryptDataTransfer        = false
18/02/20 12:33:00 INFO blockmanagement.BlockManager: maxNumBlocksToLog          = 1000
18/02/20 12:33:00 INFO namenode.FSNamesystem: Append Enabled: true
18/02/20 12:33:00 INFO util.GSet: Computing capacity for map INodeMap
18/02/20 12:33:00 INFO util.GSet: VM type       = 64-bit
18/02/20 12:33:00 INFO util.GSet: 1.0% max memory 889 MB = 8.9 MB
18/02/20 12:33:00 INFO util.GSet: capacity      = 2^20 = 1048576 entries
18/02/20 12:33:01 INFO namenode.FSDirectory: ACLs enabled? false
18/02/20 12:33:01 INFO namenode.FSDirectory: XAttrs enabled? true
18/02/20 12:33:01 INFO namenode.NameNode: Caching file names occurring more than 10 times
18/02/20 12:33:01 INFO snapshot.SnapshotManager: Loaded config captureOpenFiles: falseskipCaptureAccessTimeOnlyChange: false
18/02/20 12:33:01 INFO util.GSet: Computing capacity for map cachedBlocks
18/02/20 12:33:01 INFO util.GSet: VM type       = 64-bit
18/02/20 12:33:01 INFO util.GSet: 0.25% max memory 889 MB = 2.2 MB
18/02/20 12:33:01 INFO util.GSet: capacity      = 2^18 = 262144 entries
18/02/20 12:33:01 INFO metrics.TopMetrics: NNTop conf: dfs.namenode.top.window.num.buckets = 10
18/02/20 12:33:01 INFO metrics.TopMetrics: NNTop conf: dfs.namenode.top.num.users = 10
18/02/20 12:33:01 INFO metrics.TopMetrics: NNTop conf: dfs.namenode.top.windows.minutes = 1,5,25
18/02/20 12:33:01 INFO namenode.FSNamesystem: Retry cache on namenode is enabled
18/02/20 12:33:01 INFO namenode.FSNamesystem: Retry cache will use 0.03 of total heap and retry cache entry expiry time is 600000 millis
18/02/20 12:33:01 INFO util.GSet: Computing capacity for map NameNodeRetryCache
18/02/20 12:33:01 INFO util.GSet: VM type       = 64-bit
18/02/20 12:33:01 INFO util.GSet: 0.029999999329447746% max memory 889 MB = 273.1 KB
18/02/20 12:33:01 INFO util.GSet: capacity      = 2^15 = 32768 entries
18/02/20 12:33:01 INFO namenode.FSImage: Allocated new BlockPoolId: BP-1005057487-10.0.1.50-1519129981121
18/02/20 12:33:01 INFO common.Storage: Storage directory /opt/hadoop/dfs/name has been successfully formatted.
18/02/20 12:33:01 INFO namenode.FSImageFormatProtobuf: Saving image file /opt/hadoop/dfs/name/current/fsimage.ckpt_0000000000000000000 using no compression
18/02/20 12:33:01 INFO namenode.FSImageFormatProtobuf: Image file /opt/hadoop/dfs/name/current/fsimage.ckpt_0000000000000000000 of size 323 bytes saved in 0 seconds.
18/02/20 12:33:01 INFO namenode.NNStorageRetentionManager: Going to retain 1 images with txid >= 0
18/02/20 12:33:01 INFO namenode.NameNode: SHUTDOWN_MSG: 

/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at lxchadoopmaster/10.0.1.50
************************************************************/

Configurações especificas dos Slave Nodes:

# vi hdfs-site.xml
# Adicionar dentro da tag configuration
<property>
	<name>dfs.data.dir</name>
	<value>/opt/hadoop/dfs/name/data</value>
	<final>true</final>
</property>

Criar a diretoria onde o dfs irá armazenar dados e atribuir as permissões corretas:

# mkdir /opt/hadoop/dfs/name/data -p 
# chown hadoop.hadoop /opt/hadoop/dfs/name/data

Finalmente efetuar boot ao Hadoop.

# su - hadoop
$ start-dfs.sh
[hadoop@lxcHadoopMaster ~]$ start-dfs.sh

18/02/20 13:24:22 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

Starting namenodes on [lxchadoopmaster]

lxchadoopmaster: starting namenode, logging to /opt/hadoop/logs/hadoop-hadoop-namenode-lxcHadoopMaster.out
lxchadoopslave01: starting datanode, logging to /opt/hadoop/logs/hadoop-hadoop-datanode-lxchadoopslave01.out
lxchadoopslave02: starting datanode, logging to /opt/hadoop/logs/hadoop-hadoop-datanode-lxchadoopslave02.out

Starting secondary namenodes [0.0.0.0]

0.0.0.0: starting secondarynamenode, logging to /opt/hadoop/logs/hadoop-hadoop-secondarynamenode-lxcHadoopMaster.out

Vamos testar agora a nossa solução:

$ hdfs dfs -mkdir /user
$ hdfs dfs -mkdir /user/hadoop
$ hdfs dfs -put /opt/hadoop/hadoop-2.9.0.tar.gz /user/hadoop
# su - hadoop

$ hdfs dfsadmin -report 

Configured Capacity: 214299746304 (199.58 GB)
Present Capacity: 177115291648 (164.95 GB)
DFS Remaining: 176745652224 (164.61 GB)
DFS Used: 369639424 (352.52 MB)
DFS Used%: 0.21%

Under replicated blocks: 0
Blocks with corrupt replicas: 0
Missing blocks: 0
Missing blocks (with replication factor 1): 0
Pending deletion blocks: 0

-------------------------------------------------

Live datanodes (2):

Name: 10.0.1.51:50010 (lxchadoopslave01)

Hostname: lxchadoopslave01
Decommission Status : Normal
Configured Capacity: 107149873152 (99.79 GB)
DFS Used: 369631232 (352.51 MB)
Non DFS Used: 18407415808 (17.14 GB)
DFS Remaining: 88372826112 (82.30 GB)
DFS Used%: 0.34%
DFS Remaining%: 82.48%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 1

Last contact: Tue Feb 20 13:43:24 UTC 2018
Last Block Report: Tue Feb 20 13:24:39 UTC 2018

Name: 10.0.1.52:50010 (lxchadoopslave02)

Hostname: lxchadoopslave02
Decommission Status : Normal
Configured Capacity: 107149873152 (99.79 GB)
DFS Used: 8192 (8 KB)
Non DFS Used: 18777038848 (17.49 GB)
DFS Remaining: 88372826112 (82.30 GB)
DFS Used%: 0.00%
DFS Remaining%: 82.48%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 1

Last contact: Tue Feb 20 13:43:24 UTC 2018
Last Block Report: Tue Feb 20 13:24:39 UTC 2018

Graficamente também é possível visualizar a nossa solução, bastando apontar o nosso browser para o ip do master node, porta 50070:

http://<hadoopnode>:50070/explorer.html#/user/hadoop/

Temos agora o layer de storage que é o HDFS, e que poderá ser integrado com mais componentes de forma a garantir armazenamento, consulta e batch processing conforme o desejado.

Em termos de recursos, e tendo em conta que o load neste momento é mínimo, o comportamento dos containers que contem os ambientes de master e slave, tem o seguinte consumo de recursos ao nível do virtualizador:

# lxc-info --name lxcHadoopMaster

Name:           lxcHadoopMaster
State:          RUNNING
PID:            16796
IP:             10.0.1.50
CPU use:        7.74 seconds
BlkIO use:      71.84 MiB
Memory use:     245.66 MiB
KMem use:       0 bytes
Link:           vethYT8FS6
 TX bytes:      745.84 MiB
 RX bytes:      528.80 MiB

 Total bytes:   1.24 GiB
# lxc-info --name lxcHadoopSlave01

Name:           lxcHadoopSlave01
State:          RUNNING
PID:            10177
IP:             10.0.1.51
CPU use:        0.81 seconds
BlkIO use:      36.86 MiB
Memory use:     255.78 MiB
KMem use:       0 bytes
Link:           vethD16G04
 TX bytes:      11.53 MiB
 RX bytes:      1.07 GiB*
 Total bytes:   1.08 GiB
# lxc-info --name lxcHadoopSlave02

Name:           lxcHadoopSlave02
State:          RUNNING
PID:            10386
IP:             10.0.1.52
CPU use:        0.54 seconds
BlkIO use:      36.53 MiB
Memory use:     244.32 MiB
KMem use:       0 bytes
Link:           veth6GBUAV
 TX bytes:      8.74 MiB*
 RX bytes:      370.46 MiB
 Total bytes:   379.21 MiB

Chegamos ao fim de mais um post, desta vez, do layer de storage para Big Data!
Irei em breve fazer um post, onde será integrado o Elastic com o hdfs para fornecer analise analítica dos dados armazenados no nosso Lab de Hadoop.

Caso tenham alguma duvida sabem onde me podem encontrar, ou em alternativa, enviem-me um email para nuno[at]nuneshiggs.com

Até já!
Nuno

Referencias: http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/ClusterSetup.html