Ansible playbook that installs and configures a Hadoop cluster.
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
Andrea Dell'Amico b821ee815b
The old playbook files. Some are missing.
6 months ago
files The old playbook files. Some are missing. 6 months ago
group_vars The old playbook files. Some are missing. 6 months ago
inventory The old playbook files. Some are missing. 6 months ago
roles The old playbook files. Some are missing. 6 months ago
templates The old playbook files. Some are missing. 6 months ago
.gitignore Initial commit 6 months ago
ChangeLog The old playbook files. Some are missing. 6 months ago
LICENSE Initial commit 6 months ago
README.md The old playbook files. Some are missing. 6 months ago
README.new The old playbook files. Some are missing. 6 months ago
hadoop-common.yml The old playbook files. Some are missing. 6 months ago
nagios.yml The old playbook files. Some are missing. 6 months ago
zookeeper.yml The old playbook files. Some are missing. 6 months ago

README.md

Hadoop cluster based on the CDH 4 packages.

This is the playbook that I used to install and configure the Hadoop cluster @CNR, based on the deb packages found in the Cloudera repositories. No cloudera manager was used nor installed.

The cluster.

The cluster structure is the following:

  • jobtracker.t.hadoop.research-infrastructures.eu (3.5GB RAM, 4 CPUs):

    • mapreduce HA jobtracker
    • zookeeper quorum
    • HA HDFS journal
  • quorum4.t.hadoop.research-infrastructures.eu (3.5GB RAM, 4 CPUs):

    • mapreduce HA jobtracker
    • zookeeper quorum
    • HA HDFS journal
  • nn1.t.hadoop.research-infrastructures.eu (3.5GB RAM, 4 CPUs):

    • hdfs HA namenode
    • zookeeper quorum
    • HA HDFS journal
  • nn2.t.hadoop.research-infrastructures.eu (3.5GB RAM, 4 CPUs):

    • hdfs HA namenode
    • zookeeper quorum
    • HA HDFS journal
  • hbase-master.t.hadoop.research-infrastructures.eu (3.5GB RAM, 4 CPUs):

    • hbase primary master
    • hbase thrift
    • zookeeper quorum
    • HA HDFS journal
  • hbase-master2.t.hadoop.research-infrastructures.eu (2GB RAM, 2 CPUs):

    • HBASE secondary master
    • hbase thrift
  • node{2..13}.t.hadoop.research-infrastructures.eu (9GB RAM, 8 CPUs, 1000GB external storage for HDFS each):

    • mapreduce tasktracker
    • hdfs datanode
    • hbase regionserver
    • solr (sharded)
  • hive.t.hadoop.research-infrastructures.eu:

    • hue
    • hive
    • oozie
    • sqoop
  • db.t.hadoop.research-infrastructures.eu:

    • postgresql instance for hue and hive

Su jobtracker.t.hadoop.research-infrastructures.eu sono installati gli script che gestiscono tutti i servizi. È possibile fermare/attivare i singoli servizi oppure tutto il cluster, rispettando l'ordine corretto.

Hanno tutti prefisso "service-" e il nome dello script dà un'idea delle operazioni che verranno eseguite: service-global-hadoop-cluster service-global-hbase service-global-hdfs service-global-mapred service-global-zookeeper service-hbase-master service-hbase-regionserver service-hbase-rest service-hdfs-datanode service-hdfs-httpfs service-hdfs-journalnode service-hdfs-namenode service-hdfs-secondarynamenode service-mapreduce-jobtracker service-mapreduce-tasktracker service-zookeeper-server

Prendono come parametro "start,stop,status,restart"


dom0/nodes/san map data

dlib18x: node8 e90.6 (dlibsan9) dlib19x: node9 e90.7 (dlibsan9) dlib20x: node10 e90.8 (dlibsan9) dlib22x: node11 e90.5 (dlibsan9) node7 e63.4 (dlibsan6) dlib23x: node12 e80.3 (dlibsan8) node13 e80.4 (dlibsan8) dlib24x: node2 e25.1 (dlibsan2) node3 e74.1 (dlibsan7) dlib25x: node4 e83.4 (dlibsan8) dlib26x: node5 e72.1 (dlibsan7) node6 e63.3 (dlibsan6)


Submitting a job (supporting multiple users) To support multiple users you create UNIX user accounts only in the master node.

Sul namenode:

#groupadd supergroup (da eseguire una sola volta)

#adduser claudio ...

su - hdfs

$ hadoop dfs -mkdir /home/claudio $ hadoop dfs -chown -R claudio:supergroup /home/claudio

(aggiungere claudio al gruppo supergroup)

Important:

If you do not create /tmp properly, with the right permissions as shown below, you may have problems with CDH components later. Specifically, if you don't create /tmp yourself, another process may create it automatically with restrictive permissions that will prevent your other applications from using it.

Create the /tmp directory after HDFS is up and running, and set its permissions to 1777 (drwxrwxrwt), as follows:

$ sudo -u hdfs hadoop fs -mkdir /tmp $ sudo -u hdfs hadoop fs -chmod -R 1777 /tmp

Note:

If Kerberos is enabled, do not use commands in the form sudo -u ; they will fail with a security error. Instead, use the following commands: kinit <user> (if you are using a password) or kinit -kt (if you are using a keytab) and then, for each command executed by this user, $ Step 8: Create MapReduce /var directories

sudo -u hdfs hadoop fs -mkdir -p /var/lib/hadoop-hdfs/cache/mapred/mapred/staging sudo -u hdfs hadoop fs -chmod 1777 /var/lib/hadoop-hdfs/cache/mapred/mapred/staging sudo -u hdfs hadoop fs -chown -R mapred /var/lib/hadoop-hdfs/cache/mapred

Step 9: Verify the HDFS File Structure

$ sudo -u hdfs hadoop fs -ls -R /

You should see:

drwxrwxrwt - hdfs supergroup 0 2012-04-19 15:14 /tmp drwxr-xr-x - hdfs supergroup 0 2012-04-19 15:16 /var drwxr-xr-x - hdfs supergroup 0 2012-04-19 15:16 /var/lib drwxr-xr-x - hdfs supergroup 0 2012-04-19 15:16 /var/lib/hadoop-hdfs drwxr-xr-x - hdfs supergroup 0 2012-04-19 15:16 /var/lib/hadoop-hdfs/cache drwxr-xr-x - mapred supergroup 0 2012-04-19 15:19 /var/lib/hadoop-hdfs/cache/mapred drwxr-xr-x - mapred supergroup 0 2012-04-19 15:29 /var/lib/hadoop-hdfs/cache/mapred/mapred drwxrwxrwt - mapred supergroup 0 2012-04-19 15:33 /var/lib/hadoop-hdfs/cache/mapred/mapred/staging

Step 10: Create and Configure the mapred.system.dir Directory in HDFS

After you start HDFS and create /tmp, but before you start the JobTracker (see the next step), you must also create the HDFS directory specified by the mapred.system.dir parameter (by default ${hadoop.tmp.dir}/mapred/system and configure it to be owned by the mapred user.

To create the directory in its default location:

$ sudo -u hdfs hadoop fs -mkdir /tmp/mapred/system $ sudo -u hdfs hadoop fs -chown mapred:hadoop /tmp/mapred/system

Important:

If you create the mapred.system.dir directory in a different location, specify that path in the conf/mapred-site.xml file.

When starting up, MapReduce sets the permissions for the mapred.system.dir directory to drwx------, assuming the user mapred owns that directory. Step 11: Start MapReduce

To start MapReduce, start the TaskTracker and JobTracker services

On each TaskTracker system:

$ sudo service hadoop-0.20-mapreduce-tasktracker start

On the JobTracker system:

$ sudo service hadoop-0.20-mapreduce-jobtracker start

Step 12: Create a Home Directory for each MapReduce User

Create a home directory for each MapReduce user. It is best to do this on the NameNode; for example:

$ sudo -u hdfs hadoop fs -mkdir /user/ $ sudo -u hdfs hadoop fs -chown /user/

where is the Linux username of each user.

Alternatively, you can log in as each Linux user (or write a script to do so) and create the home directory as follows:

sudo -u hdfs hadoop fs -mkdir /user/$USER sudo -u hdfs hadoop fs -chown $USER /user/$USER


We use the jobtracker as provisioning server Correct start order (reverse to obtain the stop order): • HDFS (NB: substitute secondarynamenode with journalnode when we will have HA) • MapReduce • Zookeeper • HBase • Hive Metastore • Hue • Oozie • Ganglia • Nagios

I comandi di init si trovano nel file "init.sh" nel repository ansible.

Errore da indagare: http://stackoverflow.com/questions/6153560/hbase-client-connectionloss-for-hbase-error

GC hints

http://stackoverflow.com/questions/9792590/gc-tuning-preventing-a-full-gc?rq=1

HBASE troubleshooting

  • Se alcune region rimangono in "transition" indefinitamente, è possibile provare a risolvere il problema da shell:

su - hbase

$ hbase hbck -fixAssignments

Potrebbe essere utile anche $ hbase hbck -repairHoles


Quando si verifica: "ROOT stuck in assigning forever"

bisogna:

  • verificare che non ci siano errori relativi a zookeeper. Se ci sono, far ripartire zookeeper e poi tutto il cluster hbase
  • Far ripartire il solo hbase master

Quando ci sono tabelle disabilitate, ma che risultano impossibili da abilitare o eliminare:

su - hbase

$ hbase hbck -fixAssignments

  • Restart del master hbase

Vedi: http://comments.gmane.org/gmane.comp.java.hadoop.hbase.user/32838 Ed in generale, per capire il funzionamento: http://hortonworks.com/blog/apache-hbase-region-splitting-and-merging/

Tool per il monitoraggio di hbase quando è configurato per il manual splitting: https://github.com/sentric/hannibal


2013-02-22 10:24:46,492 INFO org.apache.hadoop.mapred.TaskTracker: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@41a7fead 2013-02-22 10:24:46,492 INFO org.apache.hadoop.mapred.TaskTracker: Starting thread: Map-events fetcher for all reduce tasks on tracker_node2.t.hadoop.research-infrastructures.eu:localhost/127.0.0.1:47798 2013-02-22 10:24:46,492 WARN org.apache.hadoop.mapred.TaskTracker: TaskTracker's totalMemoryAllottedForTasks is -1 and reserved physical memory is not configured. TaskMemoryManager is disabled. 2013-02-22 10:24:46,571 INFO org.apache.hadoop.mapred.IndexCache: IndexCache created with max memory = 10485760


Post interessante che tratta la configurazione ed i vari parametri: http://gbif.blogspot.it/2011/01/setting-up-hadoop-cluster-part-1-manual.html

Lista di nomi di parametri deprecati e il loro nuovo nome: http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/DeprecatedProperties.html


How to decommission a worker node

  1. If they are many, reduce the hdfs redundancy factor
  2. Stop the regionserver on the node
  3. Add the node to the hdfs and jobtracker exclude list

./run.sh mapred.yml -i inventory/hosts.production -l jt_masters --tags=hadoop_workers ./run.sh hadoop-hdfs.yml -i inventory/hosts.production -l hdfs_masters --tags=hadoop_workers

  1. Refresh the hdfs and jobtracker configuration

hdfs dfsadmin -refreshNodes mapred mradmin -refreshNodes

  1. Remove the node from the list of allowed ones

5a. Edit the inventory

5b. Run ./run.sh hadoop-common.yml -i inventory/hosts.production --tags=hadoop_workers ./run.sh mapred.yml -i inventory/hosts.production -l jt_masters --tags=hadoop_workers ./run.sh hadoop-hdfs.yml -i inventory/hosts.production -l hdfs_masters --tags=hadoop_workers


Nagios monitoring

  • The handlers to restart the services are managed via nrpe. To get them work, we need to:
    • Add an entry in nrpe.cfg. The command name needs to start with "global_restart_" and the remaining part of the name must coincide with the name of the service. For example: command[global_restart_hadoop-0.20-mapreduce-tasktracker]=/usr/bin/sudo /usr/sbin/service hadoop-0.20-mapreduce-tasktracker restart
    • Add a handler to the nagios service. The command needs the service name as parameter Example: event_handler restart-service!hadoop-0.20-mapreduce-tasktracker