I've had an interest in learning Hadoop for awhile and it wasn't super obvious where to start so I thought I would share the path I took in case someone else wants a starting point.
Learn Concepts
Working with big data is much different than working with smaller relational databases so I knew I had to spend some time getting a conceptual understanding before I start hacking code together. I was about to get the book but the reviews said it was already out of date. I found several references to Cloudera's site and figured I'd see what they have. There have some very helpful training videos that actually teach instead of sell and I recommend watching them. Some thoughts I had so far:
Setup Environment
Cloudera also provides serveral virtual machines that have everything all configured. I found this a good place to start because I'm coming from a developer perspective and would like to skip the system administration aspect at this stage. Here are the steps I followed:
Now I need to start writing code and see if I can make something besides the examples to work. I think I'll start with the word count tutorial. I may also see if there's something to play with from Matthews github project.
Learn Concepts
Working with big data is much different than working with smaller relational databases so I knew I had to spend some time getting a conceptual understanding before I start hacking code together. I was about to get the book but the reviews said it was already out of date. I found several references to Cloudera's site and figured I'd see what they have. There have some very helpful training videos that actually teach instead of sell and I recommend watching them. Some thoughts I had so far:
- If you're not working with very large dataset then it's not worth using this
- The way data flows through a Map/Combine/Reduce job reminds me of ETL
- It's important to think in functional instead of imperative programming.
- Debugging jobs will probably hurt my head
- Decomposing the work into very small simple steps is a best-practice
Setup Environment
Cloudera also provides serveral virtual machines that have everything all configured. I found this a good place to start because I'm coming from a developer perspective and would like to skip the system administration aspect at this stage. Here are the steps I followed:
- Installed VMware Player
- Downloaded The Cloudera VM with Hadoop preinstalled
- Changed the VM setting from 1G to 3G of RAM
- Changed the VM settings to have a CD drive (so I can install vmware tools)
- Started the VM
- Installed VMware tools
- Opened a terminal and tested the installation with some example commands.
- NOTE: Hadoop is installed in the /usr/lib/hadoop dir
- NOTE: user/pass is cloudera/cloudera and this user has sudo rights
- hadoop fs -mkdir /foo
- hadoop fs -ls /
- hadoop fs -rmr /foo
- hadoop jar /usr/lib/hadoop/hadoop-*-examples.jar pi 2 100000
Now I need to start writing code and see if I can make something besides the examples to work. I think I'll start with the word count tutorial. I may also see if there's something to play with from Matthews github project.