Java, Big Data, Data Structures and Algorithms: 2015

Wednesday, July 15, 2015

Integration of Logstash Forwarder, Logstash, ElasticSearch and Kibana

Logstash is a data pipeline which helps you in processing various events and logs coming from various systems. It is completely plugin based and with the help of these plugins, it can get input from plenty of sources and can give output to plenty of systems after processing the logs. For our use case, we'll take logstash forwarder as input and elasticsearch as output. We can attach Kibana to elasticsearch to do the analytics and create the dashboards.

Logstash Forwarder: Logstash Forwarder is a log shipper which ships logs to Logstash. It was earlier lumberjack and is written in Go language.

Here we explain how to setup the integration described earlier:

1. Name the host: Generally, IP based certificates in logstash-forwarder cause issues so name the host. For this make host entry in /etc/hosts:

127.0.0.1 maverick.logstash.com

2. Create certificates: Logstash forwarder requires certificates in order to communicate with Logstash. Path of these certificates need to be put in logstash forwader configuration as well as logstash configuration.

Use following command to generate key pairs (You need to have OpenSSL):

openssl req -x509 -nodes -newkey rsa:2048 -keyout /etc/pki/tls/private/logstash-forwarder/logstash-forwarder.key -out /etc/pki/tls/certs/logstash-forwarder/logstash-forwarder.crt -days 365

Enter name of the organisation, unit etc as prompted on command line. Make sure that you use same host name while key creation as in step 1. Above command will result in two keys - one private and another public key. You need to copy these keys to logstash host if logstash forwarder and logstash are not on same machine.

Now add the entry in CA:

openssl x509 -in /etc/pki/tls/certs/logstash-forwarder/logstash-forwarder.crt -text >> /etc/pki/tls/certs/ca-bundle.crt

3. Install Logstash:

Download logstash archive:

wget https://download.elastic.co/logstash/logstash/logstash-1.5.2.tar.gz

Unzip the archive:

tar -xzf logstash-1.5.2.tar.gz

Add configuration file for logstash named logstash.conf in logstash-1.5.2/bin. Sample logstash.conf will be like this:

input {

lumberjack {

port => 5043

type => "logs"

ssl_certificate => "/etc/pki/tls/certs/logstash-forwarder/logstash-forwarder.crt"

ssl_key => "/etc/pki/tls/private/logstash-forwarder/logstash-forwarder.key"

}

filter{

grok {

match => [ "message", "%{COMBINEDAPACHELOG}" ]

}

output{

elasticsearch {

host => "10.1.40.222"

protocol => "http"

}

stdout { codec => rubydebug }

}

Input for this is lumberjack which is logstash forwarder and output is elasticsearch. Filter plugin does processing of log events which again uses another plugin called "grok". "grok" filters accept patterns and based on the pattern it can break log event into meaningful terms. For our example, we'll use Apache Log pattern which is in-built in logstash.

Note that 10.1.40.222 is IP of the host running elastic search.

Logstash can be started using following command:

./logstash -f logstash.conf

4. Install Logstash Forwarder: Best way to install it is building it from source code. For that, we need Go first.

Download Go:

wget https://storage.googleapis.com/golang/go1.4.2.linux-amd64.tar.gz

Extract it:

tar -C /usr/local -xzf go1.4.2.linux-amd64.tar.gz

Add go to Path. Make any entry in /etc/profile

export PATH=$PATH:/usr/local/go/bin

Download logstash forwarder code:

git clone git://github.com/elasticsearch/logstash-forwarder.git

Compile the source:

go build -o logstash-forwarder

Create logstash fowarder config like this:

{

"network":

{

"servers": ["maverick.logstash.com:5043"],

"ssl_ca": "/etc/ssl/certs/new-forwarder.crt",

"ssl_key": "/etc/ssl/private/new-forwarder.key"},

"files": [ { "paths": ["/tmp/access_log"],

"fields":

{

"type": "apache"

}

]

}

Run the logstash forwader like this:

./logstash-forwarder -config logstash-forwarder.conf

5. Run the elasticsearch:

./elasticsearch

6. Once logstash-forwader, logstash and elasticsearch are running. Put apache logs at the location defined in logstash-forwarder config "/tmp/access_log" in above example. You'll find that every line of log is being put in elasticsearch as separate document.

Wednesday, July 1, 2015

Installing Mongo DB

Here are the steps to install Mongo DB version 3.0.4 on CentOS:

1. Download Mongo DB archive:

wget http://downloads.mongodb.org/linux/mongodb-linux-x86_64-3.0.4.tgz

2. Extract the archive:

tar -xzf mongodb-linux-x86_64-3.0.4.tgz

3. Mongo DB uses /data/db as default directory for data storage, so create this directory with proper permissions:

mkdir -p /data/db

4. Launch Mongo DB:

cd mongodb-linux-x86_64-3.0.4

./mongod

If everything is fine, MongoDB will start listening to some port.

Friday, June 26, 2015

Basic Concepts in ElasticSearch

Index (Noun): Index is equivalent to a database in relational database. ElasticSearch stores data in one or more indices. It uses Apache Lucene library to write and read data from index.

Index (Verb): Process of storing data in Index is called indexing.

Type: Type is equivalent to a table in a relational database. A type will contain zero or more documents. Unlike table, structure of a type is flexible and you can add new fields whenever needed.

Field: Field is equivalent to a column in relational database table. Data Types for a type can be defined through mappings.

Document: Document is equivalent to a row in a relational database table. Document consists of fields and each field has a name and one or many values. Each document can have different set of fields. You can consider a document as JSON object stored in way that they can be searched efficiently.

Cluster: A set of ElasticSearch nodes can form a cluster to achieve these - scalability, availability.
More number of nodes can be added to scale it horizontally. ElasticSearch distributes the load among multiple nodes. Data is replicated across various nodes, so in case of failure of a node, another node takes charge of serving the data on behalf of failed node and that way high availability is ensured.

Shard: ElasticSearch distributes data of an index into worker units called shards.So, essentially, it is the shard which stores the documents physically. Each shard is fully capable lucence instance in itself. Balancing of shards across nodes of a cluster is transparently done by ElasticSearch. A Shard can be either a primary shard or replica shard. A document in index belongs to one and only primary shard. ElasticSearch knows how to find a shard which contains particular document of an index. Replica shards are for scalability and fail-overs.
Read can happen either on primary shard or replica shard but write can happen only on primary shard.

How to install Redis On CentOs

You can download the source code for Redis latest version and build it.
As of this writing, latest version of Redis is 3.0.2.

These are the steps to follow:

1. Download the source code.

$wget http:// download.redis.io/releases/redis-3.0.2.tar.gz

2.Extract the archive.

$tar xzf redis-3.0.2.tar.gz

3. Build it.
$cd redis-3.0.2
$make

Note that Redis is written in C, so you need C Compiler on your machine in order to build the code.
If you don't have C compiler, use following command to install it:
$yum groupinstall 'Development Tools'

Sometimes, compilation may fail with this error:
"newer version of jemalloc required centos"
To resolve this issue, you should use make command with following switch:
$make MALLOC=libc

Test the installation:
1. Go to redis-3.0.2/src folder and start Redis server.
$./redis-server
2. Open Redis client on separate terminal.
$./redis-cli
3. Now, try some commands:
$set "catalog:product:id" 1234
$get "catalog:product:id"
If you get same value as set then everything seems fine.

Thursday, June 25, 2015

How to open a port on Linux Machine

1. Use this command to get the listing:

iptables -L -n

2. Make an entry as first entry in iptables to allow port 5601:

iptables -I INPUT 2 -p tcp --dport 5601 -j ACCEPT

3. Save the settings:
/etc/init.d/iptables save

4. Restart the service:

/etc/init.d/iptables start

Tuesday, June 9, 2015

ElasticSearch: What is it?

ElasticSearch is a distributed search framework and is built on top of lucene.
ElasticSearch removes the complexity of dealing with lucene api and exposes simple Java and REST interfaces which can be used to index and search documents.

This has certain features which differentiates it from other related search technologies:

1. Document Store - It comes with an out-of-box No-SQL document store, so you don't need to hook another data store in order to perform the search.

2. Distributed - Documents can distributes over a number of nodes which does load-balancing as well fail-over. But, all these are transparent for the user.

3. Simple cluster setup - Cluster setup doesn't require you to do much configuration. Nodes with same "cluster_name" in "elasticsearch.yml" joins the same cluster automatically.

4. Fast - This is fast because of the fact that every field is indexed and searchable.

5. Multiple API interfaces - You can use Java API as well as REST API for indexing as well as search. There are APIs available for other languages as well.

6. Scalable - It can be scaled to thousands of servers.

Note - ElasticSearch assumes that primary database for application is different and you index the data in elasticsearch to make it searchable.