Java, Big Data, Data Structures and Algorithms: June 2015

Friday, June 26, 2015

Basic Concepts in ElasticSearch

Index (Noun): Index is equivalent to a database in relational database. ElasticSearch stores data in one or more indices. It uses Apache Lucene library to write and read data from index.

Index (Verb): Process of storing data in Index is called indexing.

Type: Type is equivalent to a table in a relational database. A type will contain zero or more documents. Unlike table, structure of a type is flexible and you can add new fields whenever needed.

Field: Field is equivalent to a column in relational database table. Data Types for a type can be defined through mappings.

Document: Document is equivalent to a row in a relational database table. Document consists of fields and each field has a name and one or many values. Each document can have different set of fields. You can consider a document as JSON object stored in way that they can be searched efficiently.

Cluster: A set of ElasticSearch nodes can form a cluster to achieve these - scalability, availability.
More number of nodes can be added to scale it horizontally. ElasticSearch distributes the load among multiple nodes. Data is replicated across various nodes, so in case of failure of a node, another node takes charge of serving the data on behalf of failed node and that way high availability is ensured.

Shard: ElasticSearch distributes data of an index into worker units called shards.So, essentially, it is the shard which stores the documents physically. Each shard is fully capable lucence instance in itself. Balancing of shards across nodes of a cluster is transparently done by ElasticSearch. A Shard can be either a primary shard or replica shard. A document in index belongs to one and only primary shard. ElasticSearch knows how to find a shard which contains particular document of an index. Replica shards are for scalability and fail-overs.
Read can happen either on primary shard or replica shard but write can happen only on primary shard.

How to install Redis On CentOs

You can download the source code for Redis latest version and build it.
As of this writing, latest version of Redis is 3.0.2.

These are the steps to follow:

1. Download the source code.

$wget http:// download.redis.io/releases/redis-3.0.2.tar.gz

2.Extract the archive.

$tar xzf redis-3.0.2.tar.gz

3. Build it.
$cd redis-3.0.2
$make

Note that Redis is written in C, so you need C Compiler on your machine in order to build the code.
If you don't have C compiler, use following command to install it:
$yum groupinstall 'Development Tools'

Sometimes, compilation may fail with this error:
"newer version of jemalloc required centos"
To resolve this issue, you should use make command with following switch:
$make MALLOC=libc

Test the installation:
1. Go to redis-3.0.2/src folder and start Redis server.
$./redis-server
2. Open Redis client on separate terminal.
$./redis-cli
3. Now, try some commands:
$set "catalog:product:id" 1234
$get "catalog:product:id"
If you get same value as set then everything seems fine.

Thursday, June 25, 2015

How to open a port on Linux Machine

1. Use this command to get the listing:

iptables -L -n

2. Make an entry as first entry in iptables to allow port 5601:

iptables -I INPUT 2 -p tcp --dport 5601 -j ACCEPT

3. Save the settings:
/etc/init.d/iptables save

4. Restart the service:

/etc/init.d/iptables start

Tuesday, June 9, 2015

ElasticSearch: What is it?

ElasticSearch is a distributed search framework and is built on top of lucene.
ElasticSearch removes the complexity of dealing with lucene api and exposes simple Java and REST interfaces which can be used to index and search documents.

This has certain features which differentiates it from other related search technologies:

1. Document Store - It comes with an out-of-box No-SQL document store, so you don't need to hook another data store in order to perform the search.

2. Distributed - Documents can distributes over a number of nodes which does load-balancing as well fail-over. But, all these are transparent for the user.

3. Simple cluster setup - Cluster setup doesn't require you to do much configuration. Nodes with same "cluster_name" in "elasticsearch.yml" joins the same cluster automatically.

4. Fast - This is fast because of the fact that every field is indexed and searchable.

5. Multiple API interfaces - You can use Java API as well as REST API for indexing as well as search. There are APIs available for other languages as well.

6. Scalable - It can be scaled to thousands of servers.

Note - ElasticSearch assumes that primary database for application is different and you index the data in elasticsearch to make it searchable.