All Apache Hadoop distributions build off of the same central sourcecode hosted by the Apache Foundation. Anssr Data Engine is just the same and our distribution is built directly from the upstream source code from the Apache BigTop project who are the official stewards for Hadoop distributions. As such Anssr Data Engine has exactly the same features as the other vendors, but with the best support, reduced costs and amazing flexibilty.
Deploy Hadoop in a single command in all the leading cloud vendors, scale at will, choose your level of support, you don't need to talk to a single sales person (although you can certainly still do so if you desire)
Deploying Anssr Data Engine
There are a number of ways to deploy Hadoop but the easiest are by picking one of the two bundles below and then tuning it to your requirements. Anssr Data Engine is our reference model and contains all the components we distribute and support. The second Hadoop bundle contains just Hadoop, not bells and whistles just the important stuff.
Getting data into Hadoop
Of course a data platform isn't very useful without data. So how do you get data into a Hadoop cluster? You could stream data in via Kafka and Flume, import data via Hive, use a Sqoop job or use the HDFS tools to simply place data into the cluster using direct file transfer. To get started we will use the direct file transfer method and place some files into the cluster to start testing with.
Firstly lets SSH into one of our servers and download some sample data:
juju ssh 0 wget https://www.dropbox.com/s/wx9pzx7n4dpxp89/sales.tar.bz2?dl=1 tar xvfj sales.tar.bz2
You should then see a new folder called
The next job is to put these files onto the HDFS filesystem:
hadoop fs -mkdir /user/ubuntu/input hadoop fs -copyFromLocal sales/* /user/ubuntu/input/
Once thats done you should then be able to do:
hadoop fs -ls /user/ubuntu/input
and see the following output:
Found 3 items -rw-r--r-- 3 ubuntu hadoop 3718736 2018-06-07 14:35 /user/ubuntu/input/sales1.csv -rw-r--r-- 3 ubuntu hadoop 3560018 2018-06-07 14:35 /user/ubuntu/input/sales2.csv -rw-r--r-- 3 ubuntu hadoop 2292413 2018-06-07 14:35 /user/ubuntu/input/sales3.csv
Our files are now distributed across the HDFS cluster.
Mapping and Reducing
Map Reduce is the core concept of Hadoop and will allow us to test the platform without installing anything else. To test the HDFS cluster and its compute capability we will now compile a small wordcount program to run on the cluster.
Compile the following program or download the jar from here.
Copy the jar to the server:
juju scp ./target/wordcount-1.0-SNAPSHOT.jar 0:~
Next up run the Map Reduce job:
hadoop jar ./wordcount-1.0-SNAPSHOT.jar com.researchplatforms.WordCount /user/ubuntu/input /user/ubuntu/output
Once thats complete we should have a new folder with a file inside it:
hadoop dfs -ls /user/ubuntu hadoop dfs -ls /user/ubuntu/output hadoop dfs -cat /user/ubuntu/output/part-00000
As you can see in the output it has chunked the sales files up and calculated word repition across our test set of data.
Scaling Hadoop to suit your workload
Scaling Anssr to provide more compute resource is easy:
juju add-unit -n <x> slave
and the platform will start launching new servers that will automatically get added to your cluster.
Connecting more software