Anssr Data Platform

Anssr Data Platform

  • Docs
  • Support
  • Help
  • Blog
  • Contact

›Druid

Getting Started

  • Deploying an Anssr Bundle
  • Getting Started with Anssr Analytics
  • Getting Started with Anssr and Hadoop
  • Installing Juju
  • Getting Started with Hue

Analytics

  • Creating A Saiku Schema
  • Saiku Report Generation
  • What Is OLAP?
  • Using Apache Drill

Hadoop

  • Map Reduce 101
  • HDFS and Distributed Data
  • Hadoop Interfaces

Druid

  • Data Ingestion With Druid and Hadoop

Platform Guides

  • Setting an SLA

Further Help

  • About Anssr IO
  • Frequently Asked Questions
  • Getting Support

LXC & LXD Containers

  • About LXC & LXD

Data Ingestion With Druid and Hadoop

Introduction

Welcome to this guide on how to successfully ingest data into your Druid and Hadoop bundle. In this guide, we will guide you through the process of setting up your Anssr Druid Hadoop model, placing your data in the correct place within HDFS, and submitting a data ingestion job to Druid. We will assume you have already spun up an Anssr Druid Hadoop bundle, but if you have not, refer to our Deploying An Anssr Druid Hadoop Bundle.

Data Used For This Guide

For the purposes of this guide, we will be using crime data from the city of Chicago from between 2001-2018. You can find this data on the US Government's Data Catalog here. It should be noted that if you load this data into Druid as it is, Druid will throw errors because the first line of the CSV file contains column headers, one of which is "DATE". Therefore, before using this data, ensure that you have removed the first row of the file, leaving just the raw data. Column headers will be supplied in the ingestion task we submit to Druid later in the guide.

Placing Your Data Into HDFS

Anssr Druid Hadoop

Once your Anssr Druid Hadoop model has successfully been deployed, Druid will start to use HDFS as its data store. Therefore, we need to place our data within HDFS so that Druid can access it.

Firstly, we need to find which machine the Hadoop Resource Manager is located on. To do this, we can either use the JAAS GUI:

JaaS Status Resource Manager

or we can use juju status from the command line.

Juju Status Resource Manager

Once you have found the server number, we can ssh into this server using the following command:

juju ssh <server-number>

On this server, we can access root privileges by inputting sudo -i. Once this is done, we can access the file directory of HDFS by using the hdfs dfs command. The Druid Config charm will create a directory at /user/root/quickstart, so it is useful to place your raw data in this directory. You can use curl to download the Chicago Crime Data from above to the Resource Manager server, and then copying it to HDFS:

curl https://data.cityofchicago.org/api/views/ijzp-q8t2/rows.csv?accessType=DOWNLOAD > Crimes_-_2001_to_present.csv
hdfs dfs -put Crimes_-_2001_to_present.csv /user/root/quickstart

You can verify that the file has been positioned correctly by checking the file contents of /user/root/quickstart on HDFS:

hdfs dfs -ls /user/root/quickstart

You should see your data located in this file if you have successfully loaded the data into HDFS.

Submitting An Ingestion Job

Druid requires the user to submit an ingestion job, stating parameters for ingestion including the location of the data to be ingested. These ingestion jobs take the form of a JSON file and should be customised for your particular data ingestion requirements. For the data we are using, the following ingestion script, named crimes.json, which can be saved locally to your computer:

{
  "type": "index_hadoop",
  "spec": {
    "ioConfig": {
      "type": "hadoop",
      "inputSpec": {
        "type": "static",
        "paths": "quickstart/Crimes_-_2001_to_present.csv"
      }
    },
    "dataSchema": {
      "dataSource": "crime",
      "granularitySpec": {
        "type": "uniform",
        "segmentGranularity": "month",
        "queryGranularity": "none",
        "intervals": [
          "2001-01-01/2018-10-01"
        ]
      },
      "parser": {
        "type": "hadoopyString",
        "parseSpec": {
          "format": "csv",
          "timestampSpec": {
            "format": "MM/dd/yyyy hh:mm:ss a",
            "column": "Date"
          },
          "columns": [
            "ID",
            "Case Number",
            "Date",
            "Block",
            "IUCR",
            "Primary Type",
            "Description",
            "Location Description",
            "Arrest",
            "Domestic",
            "Beat",
            "District",
            "Ward",
            "Community Area",
            "FBI Code",
            "X Coordinate",
            "Y Coordinate",
            "Year",
            "Updated On",
            "Latitude",
            "Longitude",
            "Location"
          ],
          "dimensionsSpec": {
            "dimensions": [
            "ID",
            "Case Number",
            "Date",
            "Block",
            "IUCR",
            "Primary Type",
            "Description",
            "Location Description",
            "Arrest",
            "Domestic",
            "Beat",
            "District",
            "Ward",
            "Community Area",
            "FBI Code",
            "X Coordinate",
            "Y Coordinate",
            "Year",
            "Updated On",
            "Latitude",
            "Longitude",
            "Location"
            ]
          }
        }
      },
      "metricsSpec": [
        {
          "name": "count",
          "type": "count"
        }
      ]
    },
    "tuningConfig": {
      "type": "hadoop",
      "partitionsSpec": {
        "type": "hashed",
        "targetPartitionSize": 5000000
      },
      "jobProperties": {
 "mapreduce.job.classloader": "true" 
}

          }
  }
}

We can curl this ingestion job from our local machine to our Druid Overlord instance:

curl -X 'POST' -H 'Content-Type:application/json' -d @crimes.json http://<druid-overlord-ip-address>:8090/druid/indexer/v1/task 

You should receive a response with the job ID for this ingestion task. If you now visit the UI for Druid Overlord, located at http://<druid-overlord-ip-address>:8090, you should see an ingestion task underway. You can click the logs for this task to see progress for your task. Depending on the size of your dataset, this may take some time. When the task is complete, the Druid Overlord UI will have "SUCCESS" under its statusCode.

← PreviousNext →
  • Introduction
  • Data Used For This Guide
  • Placing Your Data Into HDFS
  • Submitting An Ingestion Job
Docs
Getting StartedGuides
Community
Anssr CommunitySlack Channel
More
BlogGitLab
Follow @spiculedata
Anssr Data Platform logo
Copyright © 2018 Anssr