Tuesday, September 27, 2011

Learning Hadoop Part 1

I've had an interest in learning Hadoop for awhile and it wasn't super obvious where to start so I thought I would share the path I took in case someone else wants a starting point.

Learn Concepts
Working with big data is much different than working with smaller relational databases so I knew I had to spend some time getting a conceptual understanding before I start hacking code together. I was about to get the book but the reviews said it was already out of date. I found several references to Cloudera's site and figured I'd see what they have. There have some very helpful training videos that actually teach instead of sell and I recommend watching them. Some thoughts I had so far:
  • If you're not working with very large dataset then it's not worth using this
  • The way data flows through a Map/Combine/Reduce job reminds me of ETL 
  • It's important to think in functional instead of imperative programming. 
  • Debugging jobs will probably hurt my head
  • Decomposing the work into very small simple steps is a best-practice

Setup Environment
Cloudera also provides serveral virtual machines that have everything all configured. I found this a good place to start because I'm coming from a developer perspective and would like to skip the system administration aspect at this stage. Here are the steps I followed:
  1. Installed VMware Player
  2. Downloaded  The Cloudera VM with Hadoop preinstalled
  3. Changed the VM setting from 1G to 3G of RAM 
  4. Changed the VM settings to have a CD drive (so I can install vmware tools)
  5. Started the VM
  6. Installed VMware tools 
  7. Opened a terminal and tested the installation with some example commands.
  8. NOTE: Hadoop is installed in the /usr/lib/hadoop dir
  9. NOTE: user/pass is cloudera/cloudera and this user has sudo rights
Example hadoop commands:
  • hadoop fs -mkdir /foo
  • hadoop fs -ls /
  • hadoop fs -rmr /foo
  • hadoop jar /usr/lib/hadoop/hadoop-*-examples.jar pi 2 100000
Next Steps
Now I need to start writing code and see if I can make something besides the examples to work. I think I'll start with the word count tutorial. I may also see if there's something to play with from Matthews github project.

Tuesday, April 19, 2011

Java EE 6 Security (without another framework)

While working on a new project I needed to add good old authentication and authorization for a RESTful web service. There are many popular frameworks out there for doing this but with the mindset that less is more I wanted to see if I could do it without adding yet another framework. I've been very happy with Java EE 6 and how they really have come a long way on simplicity so I put it to the test in the requirement. Below is all I had to do to protect my resources.

1. Configure Glassfish to provide a security realm. I used a jdbcRealm that pointed to my database. The 2 tips I have for this is if your db schema doesn't have the structure the container wants simply create a view. The second is make sure your JAAS Context name is "jdbcRealm" this is a "special" identifier and it doesn't work if you pick your own.

2. Configure the web.xml to specify what resources you want protected. It ended up looking like this:

        <display-name>user authentication</display-name>
            <web-resource-name>Requiring authorization resources</web-resource-name>
            <description>Have to be a USER</description>

3. Inject SessionContext where I needed to get Principal information for user-level authorization. It ended up being a simple method in my base resource class I call when I need to know which username was authenticated:

    private SessionContext sessionContext;

    protected String getCallerUsername() {
        Principal principle = sessionContext.getCallerPrincipal();
        return principle.getName();

I was very happy to be able to implement this common use case with Java EE 6 and not have to pull in yet another framework!

Sunday, September 19, 2010

Why we do conferences

So I'm off attending a JavaOne conference for the 3rd time. Some people may wonder why go to conferences anymore when there is so much available online and it cost lot's of money. There are 2 main reasons that I have for spending the time and money to come in person.

1. Dedicated time - Yes there are tons of resources online but with the constant pressures to get stuff done it's very easy to neglect training time. I try hard to constantly keep my skills sharp but even so I often feel guilty when I watch a 1 hour webcast of something but not a conference. I get to go from 1 hour session to 1 hour session non-stop for a week. It's a total immersion into getting up to speed and learning what's new. It seems like the difference between learning Spanish in a high-school classroom verses going to Mexico for awhile.

2. Meeting people - One of the great things about coming to a conference is the people. The estimated attendance is 35K-40K people. They are even closing down certain traffic downtown SF for this week. Last night after I got settled into my hotel I set out to find dinner - alone. Yes it's sad but I usually see SF by myself in a crowd of people. As I rode the cable car down to the wharf I met a guy that was also there for the conference. We struck up a conversation and decided to have dinner together. It turns out he is a CIO for a staffing company and we had a very pleasant conversation over dinner. We talked about the economy, the industry, technologies, travels, etc. There wasn't ulterior motives, I wasn't selling Amway, he wasn't trying to recruit me (his staffing company is non-technical), we just enjoyed meeting someone new and glad not to have to eat dinner alone. I make a special effort during these conferences to meet and talk with people to find out what others are doing and maybe learn something from them.

There is so much more to conferences than just learning new technologies. In the past it has always been a real boost to my skill set and I expect the same this time. I'll try to share some of my thoughts on this blog during the conference.

Tuesday, December 29, 2009

Misleading Exception

I came across an error today that I know I've solved at least twice. I'm writing this blog in case others or myself run into this again because the error is very misleading.

The technology I'm using is Spring and JPA. The error I got was when trying to access the database for the first time is:

org.springframework.transaction.CannotCreateTransactionException: Could not open JPA EntityManager for transaction; nested exception is java.lang.UnsupportedOperationException: Not supported by BasicDataSource

The solution was to take out all the properties in the persistence.xml file. It seems that when it's configured in both spring and JPA that it's not happy.

So my peristence.xml looks like:
<persistence-unit name="myPersistentUnit" type="RESOURCE_LOCAL">

And my applicationContext.xml looks like this:
<bean id="entityManagerFactory"
        <property name="dataSource" ref="dataSource"/>
        <property name="jpaVendorAdapter">
            <bean class="org.springframework.orm.jpa.vendor.HibernateJpaVendorAdapter">
                <property name="showSql" value="${hibernate.show.sql}"/>
                <property name="generateDdl" value="${hibernate.generateDdl}"/>
                <property name="databasePlatform" value="${hibernate.dialect}"/>

    <bean id="dataSource" class="org.apache.commons.dbcp.BasicDataSource"
        <property name="driverClassName" value="${database.driver}"/>
        <property name="url" value="${database.url}"/>
        <property name="username" value="${database.username}"/>
        <property name="password" value="${database.password}"/>
        <property name="initialSize" value="5"/>
        <property name="maxActive" value="20"/>
        <property name="maxIdle" value="5"/>
        <property name="defaultTransactionIsolation" value="2"/>
        <!-- <property name="defaultTransactionIsolation" value="1" /> -->
        <property name="validationQuery" value="select 1 from dual"/>
        <property name="testOnBorrow" value="true"/>

    <bean id="transactionManager" class="org.springframework.orm.jpa.JpaTransactionManager">
        <property name="entityManagerFactory" ref="entityManagerFactory"/>

Thursday, November 12, 2009

Agile Presentation

So there was a reorganization at work recently and several people that have been on different small teams are now all kinda combined on the same team. I think this was a good move and I thought it might be helpful to improve communication by having regular brown bags. Of course I knew by suggesting this that I would be the first voluenteer. In our team there is a wide spectrum of understanding of what Agile is so I created a presentation that started at the top with what I think Agile is all about. It's titled Agile Values, Principles and Practices.

I've posted it on my website for anyone to "reuse" at http://www.jackcrews.net/downloads/agilevpp.pptx

Saturday, October 31, 2009

Why We Do What We Do

I have recently been reading a great book titled Effective Java by Joshua Bloch. I'm only part way through it but it has really clarified a number of things for me. I have been writing Java code for over a decade now and there are many things that I do just because it feels right but if I had to articulate why, I would have to think real hard about it. In this book Josh does an excellent job explaining the why behind good programming practice and style.

To give an example, I have never used the clone method that hangs on Object. I always used a copy constructor in favor of clone. I'm not sure why but probably because that's how most people in the Java community do it and it works well. Well "item 11 - Override clone judiciously" goes into great detail about the problems and complexities of using clone and why using a copy constructor is usually (but not always) a better way to go.

Going through this book is giving me a better understanding of why we do what we do!

Sunday, June 14, 2009

Why do smart people write bad code?

I've had to look at a variety of code in the last few months written by a lot of different people (no I'm not thinking of one particular person) and I've been a little puzzled by this question - Why do smart people write bad code? Now I'm not talking about someone who normally writes good code and has to rush something in or in a lapse of judgment writes questionable code. I am guilty of writing stuff that I come back to later and say what was this person thinking only to find out it was me. I'm talking about people who seem to be very smart when you talk with them about various things but when you see there code time after time you ask yourself, "did they REALLY write that?"

To be a little more specific on what I'm talking about I'll give a few examples.
  • In a basic Object Oriented (OO) class the first thing you learn is the difference between a "is-a" and "has-a" relationship and when to use which. Granted this is not always an easy decision to make but I recently got burned by some code that was returning events and there were Started, Answered, Hangup, etc events. Someone decided that since they had similar attributes that a Hangup event "is-a" Answered event and that an Answered event "is-a" Started event. This inheritance clearly this is not true and created a bug that was difficult to find.
  • Another example that recently cause a "emergency patch for production" (which is another blog in itself) was related to a database design. When I first saw this code I almost fell out of my chair laughing. (Sorry, I shouldn't be so rude in my blog but I really did almost fall out of my chair.) They seamed to be using a flavor of the strategy pattern where there were various classes to handle different cases. This was not a problem I've done this myself. It was a little strange in that the way they determine which strategy to use was to put the name of the spring bean in the database. One problem with that is that if you ever wanted to rename one of the strategy classes you had to change content too. Again this isn't how I would have done it but not horrible. The part that surprised me was that the column with the key to look up that bean name violated 1st normal form. For those that don't remember from the database 101 class 1st normal form says that columns only contain atomic values. They violated that by making the key a comma separated value list. I can think of no advantage of doing this. It can't be faster, you can't index the key. It turns out in production that the key can have thousands of values all carefully comma separated in one column. Yuck! I know the people who wrote this code and they are smart.
To be clear all of them are really nice guys that I enjoy working with. I could go on with many many more examples but you get a feel for what I'm talking about. So I come back to my question about they guys that have years of experience or have advanced degrees or both. Here are some of my ideas and I'd like to hear yours.
  1. There is constant pressure to deliver more faster and we do things with good intentions of coming back to fix them later but never do.
  2. People don't write code in a Test Driven Design (TDD) approach which leads to messy over-engineered code.
  3. People get so dependent on frameworks to tell us how to do something that we don't remember basic OO principles.
  4. People use a framework that they don't understand and end up making it do really stupid things (then of course at that point it because a really stupid framework, just ask them, they'll tell you!)
  5. Arrogance causes "smart" people to write complicated code because it's cool or clever.
  6. Arrogance that they know what functionality will be needed in the future and they might as well just build it now instead of later. (pssst, later never comes!)
  7. People read about or talk to someone about how to do something on a different project, with different circumstances, and they blindly follow the same approach without really understanding the consequences of it.
  8. People aren't pairing or having enough design discussions with others when making important decisions.
  9. People have been working in a particular type of code (i.e. building phone switches) and think that all code should be built the same way (i.e. performance is priority #1).
Can you think of one so I can have a "10 reasons why smart people write bad code list"?

p.s. I have no intention of offending co-worker in the making of this blog, I'm sure I'll find out tomorrow the perfectly good reasons for those design decisions :)