Showing posts with label elasticsearch. Show all posts
Showing posts with label elasticsearch. Show all posts

Sunday, July 17, 2016

angelhack kuala lumpur 2016

Coding is something I have passion about and if you have been following me in my blog, you should know most of the article are big data technologies related. Two weeks ago as of this write up, a professor from MMU Malaysia asking me if I would like to join on his team on a hackathon event which held at the city centre in a mall, times square.

I took a look at the events which can be found here and here. To my astonishment it is a paid coding event which the currency operated at united state dollar even though we are in Malaysia. As this is something that I have never done before, that is, go to physical location and start to code for a project to compete against each other is one of my decisive factors and of cause, to know who are in the local development community, get to know more developers as well as have some fun!

There are mainly three challenges this year which are
* big data analytics
* o2o commerce
* smart living

and a grand prize which is an exclusive invite to a hackcelerator program and a chance to fly to silicon valley to compete further! Of cause, each categories come with prize pool!

Having practical background in big data technology and startup before, the professor think by inviting could form a team of five to compete on the big data analytics category. With that, I just give it a try and see how it goes. This event held on 4 and 5th June 2016 and of cause, on the weekend where just fit nicely with working professional schedule.

For my part, I have bring in big data technology and this time I selected elasticsearch as the nosql keystore which match with the challenge category we are aiming for. It was an unfortunate that since much of my time have been devoted to mid and backend development, my user interface development skill has been reduced tremendously. (Note to self, start to code GUI back!) But fortunately, to fix that gap, there was a good combination of software that came with elasticsearch, which is known as kibana fix what I have not bring to the table. Combination elasticsearch and kibana, we are not able to ingest the 65m of iproperty data into the machine, we are able to query and represent it nicely in colour graphs form.

As for the remaining of the team, which bring in skillsets like deep learning and predictive analysis which these two skillsets answer on 'predictive' and 'prescriptive' part. The idea of our product is that, user able to upload a picture of a property through a web interface (through a phone camera) and the system will learn that image and output the result using iproperty dataset that was store in elasticsearch and display in another page.

Below are some interesting pictures taken during the events.

the stage

panorama view

and the developers

As you can see from the above pictures, there are plenty of coders! Each skillful at what they are good at. There are estimated around 370 participants and that's including female coders too! Overall I think the organizer done really good job for this two days events. The participants actually got power, cool air, water, food, network devices and cable, junk food, tshirt, cup, writing notebook and the staffs are very helpful.

One of them actually came to me and asking what do I like, and actually handed me a paper cup with coffee as I requested. I could not express more gratitude since I got there around 8am and that's early for weekend. ;-) But I would like to point out there are definite areas to improve for next year.

* warm air on day one, it was not continuous but you feel the warm air.
* some of the schedule planned (the judging and present) written in paper is way too far off from the actual events flow.
* power and internet connection drop from time to time for two days.
* less junk food and perhaps replaced with fruits.
* the table arrangement too small for computer devices as well as personal comfortable.

But all in all, a good job to complete this two days event in a safe and sounds manner. If you know me, and the moto of this website striving for contribution and mutual benefits, I found out that there are still some and if not many of the developers who are very selfish in term of code sharing and/or code learning. For instance, all my codes and infrastructure are available online which can be found here.But in return, there are many excuses of them not share in between and something really go against my principle. I know that event such as this is compete against other team, I think what's more valuable above all these monetary values is the innovation and to cultivate coding passion. Something still I do not see this year and I don't believe we can thrive in closing environment and innovation definitely does not grow under such shallow person attitude.

hackathon ending in 5 minutes

pitching schedule

the team during pitching session.

Many coder actually slept on the mall but for me, home sweet home. I had a self reflection with the team and professor actually gave valuable insight on the night in slack and on the next days. Something that I have never experience since left academic. The coding session stop at 12pm on day 2 and then participant are free for lunch whilst the organizer and judges are preparing themselves for participant's pitching session. There were two rounds of pitching and each team just got about 2-3 minutes explanation their product and 1-2 minutes for question and answer.

It's a sad but reality that, during shortlisted team who pitch again on the stage, many judges keep using the word such as intellectual property, monetization, how you sell your products or how you make money out of your product. Honestly, these types of questions should be the responsible of the business segment professional. Event such as this should have been focus on the coding spirit and to actually value on the product that actually meet the set criterias. Yes, no doubt money is the topic of a company but that should have been an trivial issue for company, what's matter above is companies should be the main catalyst to drive the country to a better quality country. Not just project monetary society. Remember, the participant are as young as 15.

team effort

solo view

By the time the big data category announcement was made, I was dead tired.It's interesting to learn from other teams on the user interface design as well as the idea to transformed into the mockup prototype in just two days or less. To my surprise, we actually got the second prize for the challenge we compete for! That's worth 3000MYR and what's more valuable is a chance to be invited to Big App Challenge 3.0 Semi Finals. Hehe, whilst I fully acknowledge this reward rightfully belong to a team but a solo picture taken to acknowledge my contribution is not too much to ask for too.

Till then, keep coding and if you are interested to code, please feel free to contact me for any possibility to work together.




Friday, July 3, 2015

how big data can help legal firm?

Today, we are going to something a little different than our usual learning journey. By that, I mean not purely on information technology but it is somewhat related. Let me explain further. I was reading Malaysia Personal Data Protection Act 2010 or PDPA 2010 on this blog. Legal is not my profession but reading this article from information technology professional, gave several ideas.

Reading this article, no offence, but really is a daunting activity. It is long blogs and dull. :-) nonetheless, every wordings are as equally important to define what is the act should mean and what scope is an act encompasses. I think information retrieval application like elasticsearch would be a match with this. By indexing all the words in the articles and then search quickly and show which act, section and article that reference it. It would be even better with score as more relevant document is shown first. Something the lawyer would probably want to quickly find the relevant document to further read. I'm sure there are books with thousand pages and to remember every single line of the acts is almost impossible or impractical. Information technology will be able to fit for this gap for them.

For law student, this is especially useful as this will speed up the way they learn law. Nobody wanna sit there hours in library and then spend twelves hours a day to read 1000 pages. I think what drive people is we want active learning, not passived reading. So I guess with elasticsearch, they can quickly search with legal terminology and results show them the book that best serve their interest.

For each court cases, transcript or even any text data can be digitize into query-able data. Then with that, data can be turn into information, with information retrieval tools like elasticsearch. I believe a high court case would take months or even year to complete, to quickly digitize these data and be reference upon later down the day, be it during later day of this court case or in the next court case would put law firm into the next stage.

Of cause, this is just my opinion and maybe expressed only from the information technology point of view (as rightfully, I.T. is my profession), please feel free to comment and improve if you find any. Thank you.

Friday, May 8, 2015

Elasticsearch no node exception happened in tomcat web container

If you ever get the stack trace in web container log file such as below and wondering how to solve these. Then read on but first, a little background. A elasticsearch cluster 0.90 and client running on tomcat web container using elasticsearch java transport client. Both server and client running same elasticsearch version and same java version.
16.Feb 6:21:30,830 ERROR WebAppTransportClient [put]: error
org.elasticsearch.client.transport.NoNodeAvailableException: No node available
at org.elasticsearch.client.transport.TransportClientNodesService.execute(TransportClientNodesService.java:212)
at org.elasticsearch.client.transport.support.InternalTransportClient.execute(InternalTransportClient.java:106)
at org.elasticsearch.client.support.AbstractClient.index(AbstractClient.java:84)
at org.elasticsearch.client.transport.TransportClient.index(TransportClient.java:316)
at org.elasticsearch.action.index.IndexRequestBuilder.doExecute(IndexRequestBuilder.java:324)
at org.elasticsearch.action.ActionRequestBuilder.execute(ActionRequestBuilder.java:85)
at org.elasticsearch.action.ActionRequestBuilder.execute(ActionRequestBuilder.java:59)
at com.example.elasticsearch.WebAppTransportClient.put(WebAppTransportClient.java:258)
at com.example.elasticsearch.WebAppTransportClient.put(WebAppTransportClient.java:307)
at com.example.threadpool.TaskThread.run(TaskThread.java:38)
at java.lang.Thread.run(Thread.java:662)

This exception will disappear once web container is restarted but restarting webapp that often is not a good solution in production. I did a few research on line and gather a few information, they are as following:

* The default number of channels in each of these class are configured with the configuration prefix of transport.connections_per_node.
https://www.found.no/foundation/elasticsearch-networking/

* If you see NoNodeAvailableException you may have hit a connect timeout of the client. Connect timeout is 30 secs IIRC.
https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!msg/elasticsearch/VyNpCs17aTA/CcXkYvVMYWAJ

* You can set org.elasticsearch.client.transport to TRACE level in your logging configuration (on the client side) to see the failures it has (to connect for example). For more information, you can turn on logging on org.elasticsearch.client.transport.
https://groups.google.com/forum/#!topic/elasticsearch/Mt2x4d5BCGI

* This means that you started to get disconnections between the client (transport) and the server. It will try and reconnect automatically, and possibly manages to do it. For more information, you can turn on logging on org.elasticsearch.client.transport.
* Can you try and increase the timeout and see how it goes? Set client.transport.ping_timeout in the settings you pass to the TransportClient to 10s for example.
* We had the same problem. reason: The application server uses a older version of log4j than ES needed.
http://elasticsearch-users.115913.n3.nabble.com/No-node-available-Exception-td3920119.html

* The correct method is to add the known host addresses with addTransportAddresses() and afterwards check the connectedNodes() method. If it returns empty list, no nodes could be found.
https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!msg/elasticsearch/ceH3UIy14jM/XJSFKd8kAXEJ

* the most common case for NoNodeAvailable is the regular pinging that the transport client does fails to do it, so no nodes end up as the list of nodes that the transport client uses. If you will set client.transport (or org.elasticsearch.client.transport if running embedded) to TRACE, you will see the pinging effort and if it failed or not (and the reason for the failures). This might get us further into trying to understand why it happens.
* .put("client.transport.ping_timeout", pingTimeout)
* .put("client.transport.nodes_sampler_interval", pingSamplerInterval).build();
https://groups.google.com/forum/#!msg/elasticsearch/9aSkB0AVrHU/_4kDkjAFKuYJ

* this has nothing to do with migration errors. Your JVM performs a very long GC of 9 seconds which exceeds the default ping timeout of 5 seconds, so ES dropped the connection ,assuming your JVM is just too busy. Try again if you can reproduce it. If yes, increase the timeout to something like 10 seconds, or consider to update your Java version.
http://elasticsearch-users.115913.n3.nabble.com/Migration-errors-0-20-1-to-0-90-td4035165.html

* During long GC the JVM is somehow suspended. So your client can not see it anymore.
http://grokbase.com/t/gg/elasticsearch/136fw0hppp/transport-client-ping-timeout-no-node-available-exception

* You wrote that you have a 0.90.9 cluster but you added 0.90.0 jars to the client. Is that correct?
* Please check:
*
* if your cluster nodes and client node is using exactly the same JVM
* if your cluster and client use exactly the same ES version
* if your cluster and client use the same cluster name
* reasons outside ES: IP blocking, network reachability, network interfaces, IPv4/IPv6 etc.
* Then you should be able to connect with TransportClient.

https://groups.google.com/forum/#!msg/elasticsearch/fYmKjGywe8o/z9Ci5L5WjUAJ

So I have tried all that option mentioned and the problem solve by added sniff to the transport client setting. 08988For more information, read here.

I hope this will solve your problem too.

Sunday, March 15, 2015

Investigate into elasticsearch already flushing exception

Today, we will take a look at another elasticsearch issue. The example of message output in the log file happened in one of my node.
[2015-03-02 04:51:30,450][DEBUG][action.admin.indices.flush] [node02] [index_A][1], node[DEF_-ABC], [P], s[STARTED]: Failed to execute [org.elasticsearch.action.admin.indices.flush.FlushRequest@23465ba5]
org.elasticsearch.index.engine.FlushNotAllowedEngineException: [index_A][1] already flushing...
at org.elasticsearch.index.engine.robin.RobinEngine.flush(RobinEngine.java:825)
at org.elasticsearch.index.shard.service.InternalIndexShard.flush(InternalIndexShard.java:564)
at org.elasticsearch.action.admin.indices.flush.TransportFlushAction.shardOperation(TransportFlushAction.java:114)
at org.elasticsearch.action.admin.indices.flush.TransportFlushAction.shardOperation(TransportFlushAction.java:49)
at org.elasticsearch.action.support.broadcast.TransportBroadcastOperationAction$AsyncBroadcastAction$2.run(TransportBroadcastOperationAction.java:225)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)

We will investigate into the code base lead to the exception above and to decide if this is a problem. Below is the snippet of the "entry" elasticsearch code.
public void run() {
try {
onOperation(shard, shardIndex, shardOperation(shardRequest));
} catch (Throwable e) {
onOperation(shard, shardIt, shardIndex, e);
}
}

so we read above, this is an asychronous broadcast action execution happened during performing an action. Code execution path tell that this shard is a local shard and encounter exception during flushing.
    @Override
protected ShardFlushResponse shardOperation(ShardFlushRequest request) throws ElasticSearchException {
IndexShard indexShard = indicesService.indexServiceSafe(request.index()).shardSafe(request.shardId());
indexShard.flush(new Engine.Flush().type(request.full() ? Engine.Flush.Type.NEW_WRITER : Engine.Flush.Type.COMMIT_TRANSLOG).force(request.force()));
return new ShardFlushResponse(request.index(), request.shardId());
}

Tracing up in the stack trace, we noticed it is an internal index shard class during flushing of the shard.
    @Override
public void flush(Engine.Flush flush) throws ElasticSearchException {
// we allows flush while recovering, since we allow for operations to happen
// while recovering, and we want to keep the translog at bay (up to deletes, which
// we don't gc).
verifyStartedOrRecovering();
if (logger.isTraceEnabled()) {
logger.trace("flush with {}", flush);
}
long time = System.nanoTime();
engine.flush(flush);
flushMetric.inc(System.nanoTime() - time);
}

As can be read above, we are at the end of the investigation journey here, it turn out that this shard flushing count is greater than one and is false, the flushing count get decrease and exception is thrown and tell that this shard is currently being flush.

As to why this is happening it is puzzling, but I never see this exception anymore, and from the investigation above, it does not look like this is an harm since the shard in concern is flushing currently.

If you think this analysis is not correct or you have better insight, please leave your comment below. Thank you.

Saturday, March 14, 2015

Investigate into elasticsearch indices memory marking shard active or inactive

If you have enable logging for indices memory controller indices.memory: DEBUG in logging.yml in elasticsearch 0.90 to see how is the shard behaving, you will notice in the log file, messages such as below happen quite often.
[2014-12-11 15:22:27,562][DEBUG][indices.memory           ] [es01] recalculating shard indexing buffer (reason=active/inactive[true] created/deleted[false]), total is [4.2gb] with [11] active shards, each shard set to indexing=[391.8mb], translog=[64kb]
[2014-12-11 15:23:57,562][DEBUG][indices.memory ] [es01] marking shard [index_A][7] as inactive (inactive_time[30m]) indexing wise, setting size to [500kb]
[2014-12-11 15:23:57,562][DEBUG][indices.memory ] [es01] marking shard [index_B][0] as inactive (inactive_time[30m]) indexing wise, setting size to [500kb]

Tracing into the code base, class IndexingMemoryController , we noticed that a periodic with default interval of 30seconds, runnable instance ShardsIndicesStatusChecker is created. Reading into this class, we see that the algorithm is coded in such a way to loop through the indices service and for a index service, check the index shard status. Based on the log output above, we see that the control goes into this path. The shard indexing inactive is true.
    if (!status.inactiveIndexing) {
// mark it as inactive only if enough time has passed and there are no ongoing merges going on...
if ((time - status.time) > inactiveTime.millis() && indexShard.mergeStats().getCurrent() == 0) {
// inactive for this amount of time, mark it
activeToInactiveIndexingShards.add(indexShard);
status.inactiveIndexing = true;
activeInactiveStatusChanges = true;
logger.debug("marking shard [{}][{}] as inactive (inactive_time[{}]) indexing wise, setting size to [{}]", indexShard.shardId().index().name(), indexShard.shardId().id(), inactiveTime, Engine.INACTIVE_SHARD_INDEXING_BUFFER);
}
}

Going further down the code, because there is a change in the variable activeInactiveStatusChanges, the code if statement evaluation become true,
if (shardsCreatedOrDeleted || activeInactiveStatusChanges) {
calcAndSetShardBuffers("active/inactive[" + activeInactiveStatusChanges + "] created/deleted[" + shardsCreatedOrDeleted + "]");
}

Tracing into the code on method calcAndSetShardBuffers(), we see that the shard buffer size is calculated accordingly within a range. The range for default miniumum shard buffer is 4MB and default maximum shard buffer 512MB. Similar algorithm happen to translog buffer size as well. Default minimum translog buffer size is 2KB and default maximum translog buffer size is 64KB. Then we see the log message such as the one above is written using the code below.
logger.debug("recalculating shard indexing buffer (reason={}), total is [{}] with [{}] active shards, each shard set to indexing=[{}], translog=[{}]", reason, indexingBuffer, shardsCount, shardIndexingBufferSize, shardTranslogBufferSize);

From this analysis, it looks to me elasticsearch is working fine checking the index shard status, and recalculate the shard and translog buffer accordingly. If you think this analysis is incorrect or would like to contribute more information, please leave your comment below.

Sunday, March 1, 2015

Study logic elasticsearch discovery zen fd ping_timeout and ping_retries

Today, we are going to study two parameter from elasticsearch version 0.90.7, specifically

  • discovery.zen.fd.ping_timeout

  • discovery.zen.fd.ping_retries


Let's find the definition from official documentation, here and here.
There are two fault detection processes running. The first is by the master, to ping all the other nodes in the cluster and verify that they are alive. And on the other end, each node pings to master to verify if its still alive or an election process needs to be initiated.

 

ping_timeout How long to wait for a ping response, defaults to 30s.
ping_retries How many ping failures / timeouts cause a node to be considered failed. Defaults to 3.

So this setting is use by master node and data node for detection if the node is okay or not and once ping, the duration time to wait is 30seconds (default) and if 3 times, a node is considered down/failed if it exceed 3 times. Okay, let's go into the code.

NodesFaultDetection.java

Within the class NodesFaultDetection, there is a inner class SendPingRequest. As it implement interface Runnable, the run method will be execute by an executor. Instead, object are read and write so emulate a ping behaviour, you can read the class PingRequest for more information. As you noticed, ping_timeout is pass to the super class of PingRequest.

The essence of logic is pretty much written in the statement transportService.sendRequest(final DiscoveryNode node, final String action, final TransportRequest request, final TransportRequestOptions options, TransportResponseHandler<T> handler). You would think it would be an ICMP ping, but it is not, there isn't isReachable() is called.

In the method, handleResponse(PingResponse response), we see that, the retry count is reset to 0 and then this SendPingRequest object is schedule again with the ping_interval you set earlier. In the method, handleException(TransportException exp). we see that the variable retryCount is increase by one, and if the current retry count exceed the default 3 times, then the node is considered dead and were removed. If the current retry count is less than the default 3 times, then another ping is send with the same ping_timeout.

MasterFaultDetection.java

Master fault detection is a little different than nodes fault detection. When the public method of MasterFaultDetection start is called and then method innerStart(), object MasterPinger is created.
this.masterPinger = new MasterPinger();
// start the ping process
threadPool.schedule(pingInterval, ThreadPool.Names.SAME, masterPinger);

So there is a periodic ping of default 1 second. When instance of MasterPinger is run, we noticed that it goes through the same process of sending the request using transport service. transportService.sendRequest(final DiscoveryNode node, final String action, final TransportRequest request, final TransportRequestOptions options, TransportResponseHandler<T> handler)
The logic of this request sending is same with NodesFaultDetection. What interesting is the method override in class BaseTransportResponseHandler. In handleResponse, so we see the the retry count is reset back to 0. Then another ping is scheduled.

In the override method handleException(TransportException exp) , so there are three exception check on the master if it no longer a master, or ping to a non master ping a master but does not exists on it. Now at the stage, retry count is increase by one. If current retry count greater than or equal to the default 3 times, then ping to node by this master is falied, this node consider failed. if current retry count less than the default three, another ping is sent.

That's it, if you think this analysis need improvement, please leave your comment below. Thank you.

Friday, February 27, 2015

how to determine currently occupied queue size and cache usage in elasticsearch 0.90

Have you encounter situation like, when a elasticsearch client is indexing into elasticsearch cluster, the client get rejected exception from the cluster? What about if you have cache some filters in your query and you want to know how much memory is used at the moment? If yes, and you are using elasticsearch 0.90, then you come to the right place. I'm going to show you how to show these statistics through elasticsearch exposed metric API. This is important if you want to determine the health of your cluster.

Okay, let's start with the first one, how to get the occupied queue size in the node cluster.
[jason@node009 ~]$ curl -XGET 'http://localhost:9200/_nodes/node009/stats/thread_pool?pretty'
{
"cluster_name" : "MY_TEST_Cluster",
"nodes" : {
"1111111111111111111111" : {
"timestamp" : 1422372473667,
"name" : "node009",
"transport_address" : "inet[my.private.ip.com/1.2.3.4:9300]",
"hostname" : "node009.foobar.com",
"thread_pool" : {
"generic" : {
"threads" : 1,
"queue" : 0,
"active" : 0,
"rejected" : 0,
"largest" : 82,
"completed" : 6378594
},
"index" : {
"threads" : 8,
"queue" : 0,
"active" : 0,
"rejected" : 0,
"largest" : 8,
"completed" : 25735782
},
"get" : {
"threads" : 0,
"queue" : 0,
"active" : 0,
"rejected" : 0,
"largest" : 0,
"completed" : 0
},
"snapshot" : {
"threads" : 4,
"queue" : 0,
"active" : 0,
"rejected" : 0,
"largest" : 4,
"completed" : 1003286
},
"merge" : {
"threads" : 4,
"queue" : 0,
"active" : 0,
"rejected" : 0,
"largest" : 4,
"completed" : 4863710
},
"suggest" : {
"threads" : 0,
"queue" : 0,
"active" : 0,
"rejected" : 0,
"largest" : 0,
"completed" : 0
},
"bulk" : {
"threads" : 8,
"queue" : 0,
"active" : 0,
"rejected" : 0,
"largest" : 8,
"completed" : 42148
},
"optimize" : {
"threads" : 0,
"queue" : 0,
"active" : 0,
"rejected" : 0,
"largest" : 0,
"completed" : 0
},
"warmer" : {
"threads" : 4,
"queue" : 0,
"active" : 0,
"rejected" : 0,
"largest" : 4,
"completed" : 2087615
},
"flush" : {
"threads" : 3,
"queue" : 0,
"active" : 1,
"rejected" : 0,
"largest" : 4,
"completed" : 10492
},
"search" : {
"threads" : 512,
"queue" : 0,
"active" : 0,
"rejected" : 0,
"largest" : 512,
"completed" : 245843
},
"percolate" : {
"threads" : 0,
"queue" : 0,
"active" : 0,
"rejected" : 0,
"largest" : 0,
"completed" : 0
},
"management" : {
"threads" : 5,
"queue" : 0,
"active" : 1,
"rejected" : 0,
"largest" : 5,
"completed" : 2082438
},
"refresh" : {
"threads" : 4,
"queue" : 0,
"active" : 0,
"rejected" : 0,
"largest" : 4,
"completed" : 1521727
}
}
}
}
}

As can be read above, this is node statistics and only thread pools stats are exposed. So if the client are actively index into elasticsearch, the metric you should look at is index. So the above sample look pretty good, there is no queue and no rejection.

Next, we will take a look at cache usage. Using the same node stats api but change the thread_pool to indices.
[jason@node009 ~]$ curl -XGET 'http://localhost:9200/_nodes/node009/stats/indices?pretty'
{
"cluster_name" : "MY_TEST_Cluster",
"nodes" : {
"1111111111111111111111" : {
"timestamp" : 1422373322128,
"name" : "node009",
"transport_address" : "inet[my.private.ip.com/1.2.3.4:9300]",
"hostname" : "node009.foobar.com",
"indices" : {
"docs" : {
"count" : 134502646,
"deleted" : 104806463
},
"store" : {
"size" : "340.9gb",
"size_in_bytes" : 366092384499,
"throttle_time" : "2.1ms",
"throttle_time_in_millis" : 2
},
"indexing" : {
"index_total" : 25692998,
"index_time" : "6.8h",
"index_time_in_millis" : 24495073,
"index_current" : 22015,
"delete_total" : 13217673,
"delete_time" : "14.6m",
"delete_time_in_millis" : 877101,
"delete_current" : 0
},
"get" : {
"total" : 0,
"get_time" : "0s",
"time_in_millis" : 0,
"exists_total" : 0,
"exists_time" : "0s",
"exists_time_in_millis" : 0,
"missing_total" : 0,
"missing_time" : "0s",
"missing_time_in_millis" : 0,
"current" : 0
},
"search" : {
"open_contexts" : 1,
"query_total" : 204027,
"query_time" : "1.9h",
"query_time_in_millis" : 6856699,
"query_current" : 0,
"fetch_total" : 34409,
"fetch_time" : "2.4m",
"fetch_time_in_millis" : 148210,
"fetch_current" : 0
},
"merges" : {
"current" : 0,
"current_docs" : 0,
"current_size" : "0b",
"current_size_in_bytes" : 0,
"total" : 563950,
"total_time" : "7.5h",
"total_time_in_millis" : 27166425,
"total_docs" : 219150610,
"total_size" : "374.6gb",
"total_size_in_bytes" : 402324404903
},
"refresh" : {
"total" : 1522907,
"total_time" : "10.5h",
"total_time_in_millis" : 38110904
},
"flush" : {
"total" : 10499,
"total_time" : "2.4h",
"total_time_in_millis" : 8726951
},
"warmer" : {
"current" : 0,
"total" : 2089352,
"total_time" : "9.3m",
"total_time_in_millis" : 560985
},
"filter_cache" : {
"memory_size" : "2.8gb",
"memory_size_in_bytes" : 3011274608,
"evictions" : 35449
},
"id_cache" : {
"memory_size" : "0b",
"memory_size_in_bytes" : 0
},
"fielddata" : {
"memory_size" : "140.5mb",
"memory_size_in_bytes" : 147415629,
"evictions" : 86803
},
"completion" : {
"size" : "231b",
"size_in_bytes" : 231
},
"segments" : {
"count" : 700
}
}
}
}
}

As seem above, there are two cache metrics, filter cache and id cache. Right now it is pretty clear, how much this cache is used in this node and how much evictions happened. There is also a metric, fielddata which is occupied memory in the jvm, you might want to keep an eye during monitoring. If you want to know exactly what field using how much memory, you can use this api
curl localhost:9200/_nodes/stats/indices/fielddata/field1,field2?pretty

But this one is left for you to play with as a home work. Hints, to replace field1 and field2 to the value you index and read the output. That's it. :-)

Friday, August 1, 2014

CVE-2014-3120 Elastic Search Remote Code Execution

First off, I would like to express the intention of this article is to protect and safeguard of administrators / developers who uses elastic search cluster in their production system and to make a living for the family. This blog is to make aware for those who deploy elasticsearch cluster and you should be aware of it and protect against the malicious attack. I take no responsibility if you and/or your evil minded take this to damage others.

To quickly fix for this issue, you should set this
script.disable_dynamic: true

to your elasticsearch.yaml configuration file and restart elasticsearch instance.

If you have noticed, disable_dynamic is set to false in elasticsearch version 1.1.2 and below. However, it is set to true after 1.2.0.

Just load this html file CVE-2014-3120 in your browser and then change the field "ES_IP_Address" and the and field "File to read/append to". If your es allow access via port 9200, it will show the content but if you  have block the port and disable the dynamic scripting, then you are safe.

If the file content is shown, then you can start to write to it. You need to change the html source to allow write. When this happened, the attacker will be able to gain access to your box using public/private key. That's not good!

The said html file is adaptation from http://www.exploit-db.com/exploits/33370/ and if you are interested to read more, please read this link.

Monday, December 23, 2013

Elasticsearch index slow log for search and indexing

Today, we are going to learn on the logging for elasticsearch for its search and index. In elasticsearch config file, elasticsearch.yml, it should have a configuration such as below:
################################## Slow Log ##################################

# Shard level query and fetch threshold logging.

#index.search.slowlog.threshold.query.warn: 10s
#index.search.slowlog.threshold.query.info: 5s
#index.search.slowlog.threshold.query.debug: 2s
index.search.slowlog.threshold.query.trace: 500ms

#index.search.slowlog.threshold.fetch.warn: 1s
#index.search.slowlog.threshold.fetch.info: 800ms
#index.search.slowlog.threshold.fetch.debug: 500ms
index.search.slowlog.threshold.fetch.trace: 200ms

#index.indexing.slowlog.threshold.index.warn: 10s
#index.indexing.slowlog.threshold.index.info: 5s
#index.indexing.slowlog.threshold.index.debug: 2s
index.indexing.slowlog.threshold.index.trace: 500ms

So with this example, I have enable tracing for search query and search fetch with 500ms and 200ms respectively. A search in elasticsearch consists of query time and fetch time. Hence the two configuration for search. Meanwhile, logging for elasticsearch index is also enable with a threshold of 500ms.

With these configuration sets, and if your indexing or search exceed that threshold,
an entry will be log into a file. The logging file should be located in path.log
that is set in elasticsearch.yml.

So what does the number really means? Excerpts from elasticsearch official documentation

The logging is done on the shard level scope, meaning the executionof a search request within a specific shard. It does not encompass the whole search request, which can be broadcast to several shards in order to execute. Some of the benefits of shard level logging is the association of the actual execution on the specific machine, compared with request level.


 

All settings are index level settings (and each index can have different values for it), and can be changed in runtime using the indexupdate settings API.


 

... and, I have tried updating the index setting via a simple tool I've made earlier on. But the idea is same, you just need to http get by putting the variable into the index setting. You can find more information here The key for the configuration is available at ShardSlowLogSearchService.java class.
[jason@node1 bin]$ ./indices-setting.sh set search.slowlog.threshold.query.trace 500
{
"ok" : true,
"acknowledged" : true
}

[2013-12-23 12:31:12,758][TRACE][index.search.slowlog.query] [node1] [index_test][146] took[1s], took_millis[1026], types[foo,bar], stats[], search_type[QUERY_THEN_FETCH], total_shards[90], source[{"size":80,"timeout":10000,"query":{"filtered":{"query":{"query_string":{"query":"maxis*","default_operator":"and"}},"filter":{"and":{"filters":[{"query":{"match":{"site":{"query":"www.google.com","type":"boolean"}}}},{"range":{"unixtimestamp":{"from":null,"to":1387825199000,"include_lower":true,"include_upper":true}}}]}}}},"filter":{"query":{"match":{"site":{"query":"www.google.com","type":"boolean"}}}},"sort":[{"unixtimestamp":{"order":"desc"}}]}], extra_source[],

With this example, it has exceed the threshold set at 500ms which it ran for 1 second.

As for indexing, the fundamental concept is the same, so we won't elaborate in this article and that should leave you as a tutorial. :-)

Friday, December 20, 2013

Changing ElasticSearch logging level by updating cluster setting.

In this article, we are going to learn how to update logging for all the in the elasticsearch cluster. Because logging is crucial in understanding the system behaviour, so from time to time, change the logging level in elasticsearch via elasticsearch.yml and restart elasticsearch instance so that the logging level will be pick up. Unfortunately restart on the live production will take sometime (because of the shards recovery) and this could not be efficient.

Luckily, there is a setting in the cluster which allow the logging level to be change on the fly.

So with that, if you want to understand the what's happening in the cluster node, you can change the logging

e.g.
curl -XPUT localhost:9200/_cluster/settings -d '{
"transient" : {
"logger.cluster.service" : "DEBUG"
}
}'

and tail the elasticsearch log, you should see some log started appearing. Because logging is managed by the class NodeSettingsService, so you should read into the elasticsearch package that initialized with this class. Example elasticsearch package, cluster.service, cluster.routing.allocation.allocator, indices.ttl.IndicesTTLService, etc. Note that the package prefix, org.elastic is not needed when the setting is updated.

If you want more information, this link would provide better help.