Data Ingestion With Druid and Hadoop
Introduction
Welcome to this guide on how to successfully ingest data into your Druid and Hadoop bundle. In this guide, we will guide you through the process of setting up your Anssr Druid Hadoop model, placing your data in the correct place within HDFS, and submitting a data ingestion job to Druid. We will assume you have already spun up an Anssr Druid Hadoop bundle, but if you have not, refer to our Deploying An Anssr Druid Hadoop Bundle.
Data Used For This Guide
For the purposes of this guide, we will be using crime data from the city of Chicago from between 2001-2018. You can find this data on the US Government's Data Catalog here. It should be noted that if you load this data into Druid as it is, Druid will throw errors because the first line of the CSV file contains column headers, one of which is "DATE". Therefore, before using this data, ensure that you have removed the first row of the file, leaving just the raw data. Column headers will be supplied in the ingestion task we submit to Druid later in the guide.
Placing Your Data Into HDFS
Once your Anssr Druid Hadoop model has successfully been deployed, Druid will start to use HDFS as its data store. Therefore, we need to place our data within HDFS so that Druid can access it.
Firstly, we need to find which machine the Hadoop Resource Manager is located on. To do this, we can either use the JAAS GUI:
or we can use juju status
from the command line.
Once you have found the server number, we can ssh into this server using the following command:
juju ssh <server-number>
On this server, we can access root privileges by inputting sudo -i
. Once this is done, we can access the file directory of HDFS by using the hdfs dfs
command. The Druid Config charm will create a directory at /user/root/quickstart
, so it is useful to place your raw data in this directory. You can use
curl to download the Chicago Crime Data from above to the Resource Manager server, and then copying it to HDFS:
curl https://data.cityofchicago.org/api/views/ijzp-q8t2/rows.csv?accessType=DOWNLOAD > Crimes_-_2001_to_present.csv
hdfs dfs -put Crimes_-_2001_to_present.csv /user/root/quickstart
You can verify that the file has been positioned correctly by checking the file contents of /user/root/quickstart
on HDFS:
hdfs dfs -ls /user/root/quickstart
You should see your data located in this file if you have successfully loaded the data into HDFS.
Submitting An Ingestion Job
Druid requires the user to submit an ingestion job, stating parameters for ingestion including the location of the data to be ingested. These ingestion jobs take the form of a JSON file and should be customised for your particular data ingestion requirements. For the data we are using, the following ingestion script, named crimes.json, which can be saved locally to your computer:
{
"type": "index_hadoop",
"spec": {
"ioConfig": {
"type": "hadoop",
"inputSpec": {
"type": "static",
"paths": "quickstart/Crimes_-_2001_to_present.csv"
}
},
"dataSchema": {
"dataSource": "crime",
"granularitySpec": {
"type": "uniform",
"segmentGranularity": "month",
"queryGranularity": "none",
"intervals": [
"2001-01-01/2018-10-01"
]
},
"parser": {
"type": "hadoopyString",
"parseSpec": {
"format": "csv",
"timestampSpec": {
"format": "MM/dd/yyyy hh:mm:ss a",
"column": "Date"
},
"columns": [
"ID",
"Case Number",
"Date",
"Block",
"IUCR",
"Primary Type",
"Description",
"Location Description",
"Arrest",
"Domestic",
"Beat",
"District",
"Ward",
"Community Area",
"FBI Code",
"X Coordinate",
"Y Coordinate",
"Year",
"Updated On",
"Latitude",
"Longitude",
"Location"
],
"dimensionsSpec": {
"dimensions": [
"ID",
"Case Number",
"Date",
"Block",
"IUCR",
"Primary Type",
"Description",
"Location Description",
"Arrest",
"Domestic",
"Beat",
"District",
"Ward",
"Community Area",
"FBI Code",
"X Coordinate",
"Y Coordinate",
"Year",
"Updated On",
"Latitude",
"Longitude",
"Location"
]
}
}
},
"metricsSpec": [
{
"name": "count",
"type": "count"
}
]
},
"tuningConfig": {
"type": "hadoop",
"partitionsSpec": {
"type": "hashed",
"targetPartitionSize": 5000000
},
"jobProperties": {
"mapreduce.job.classloader": "true"
}
}
}
}
We can curl this ingestion job from our local machine to our Druid Overlord instance:
curl -X 'POST' -H 'Content-Type:application/json' -d @crimes.json http://<druid-overlord-ip-address>:8090/druid/indexer/v1/task
You should receive a response with the job ID for this ingestion task. If you now visit the UI for Druid Overlord, located at
http://<druid-overlord-ip-address>:8090
, you should see an ingestion task underway. You can click the logs for this task to
see progress for your task. Depending on the size of your dataset, this may take some time. When the task is complete, the
Druid Overlord UI will have "SUCCESS" under its statusCode.