Sunday, May 24, 2015

Learning facets in elasticsearch 0.90

Today we are going to learn facet in elasticsearch. In this article, we are going to use elasticsearch 0.90.7 and with this official documentation. Let's get started.

First we index a few data for facets queries later. We are going to create index articles with type article and mainly changes on field tags.

 [user@localhost ~]$ curl -X POST "http://localhost:9200/articles/article?pretty" -d '{"title" : "One",  "tags" : ["foo"]}'  
 [user@localhost ~]$ curl -X POST "http://localhost:9200/articles/article?pretty" -d '{"title" : "Two",  "tags" : ["foo", "bar"]}'  
 [user@localhost ~]$ curl -X POST "http://localhost:9200/articles/article?pretty" -d '{"title" : "Three", "tags" : ["foo", "bar", "baz"]}'  
 [user@localhost ~]$ curl -X POST "http://localhost:9200/articles/article?pretty" -d '{"title" : "Five", "tags" : ["doo", "alpha", "omega"]}'  
 [user@localhost ~]$ curl -X POST "http://localhost:9200/articles/article?pretty" -d '{"title" : "Six", "tags" : ["doo", "beep", "ultra"]}'  
 [user@localhost ~]$ curl -X POST "http://localhost:9200/articles/article?pretty" -d '{"title" : "Seven", "tags" : ["doo", "boop", "beta"]}'  
 [user@localhost ~]$ curl -X POST "http://localhost:9200/articles/article?pretty" -d '{"title" : "Nine", "tags" : ["doo", "gamma", "beep"]}'  

 [user@localhost ~]$ curl -XGET 'http://localhost:9200/articles/_mapping?pretty'  
 {  
  "articles" : {  
   "article" : {  
    "properties" : {  
     "tags" : {  
      "type" : "string"  
     },  
     "title" : {  
      "type" : "string"  
     }  
    }  
   }  
  }  
 }  

Okay, as we can read above index article mapping, both type are string. From the article, "The field used for facet calculations must be of type numeric, date/time or be analyzed as a single token — see the Mapping guide for details on the analysis process.". Okay, let's experiment with different type of facets.

 [user@localhost ~]$ curl -X POST "http://localhost:9200/articles/_search?pretty=true" -d ' { "query" : { "query_string" : {"query" : "T*"} }, "facets" : { "tags" : { "terms" : {"field" : "tags"} } } } '  
 {  
  "took" : 90,  
  "timed_out" : false,  
  "_shards" : {  
   "total" : 5,  
   "successful" : 5,  
   "failed" : 0  
  },  
  "hits" : {  
   "total" : 2,  
   "max_score" : 1.0,  
   "hits" : [ {  
    "_index" : "articles",  
    "_type" : "article",  
    "_id" : "76AjyLVST4aRhY0JE2jlAw",  
    "_score" : 1.0, "_source" : {"title" : "Two",  "tags" : ["foo", "bar"]}  
   }, {  
    "_index" : "articles",  
    "_type" : "article",  
    "_id" : "3f3LNtvOT0GmZ4FNpL4wxA",  
    "_score" : 1.0, "_source" : {"title" : "Three", "tags" : ["foo", "bar", "baz"]}  
   } ]  
  },  
  "facets" : {  
   "tags" : {  
    "_type" : "terms",  
    "missing" : 0,  
    "total" : 5,  
    "other" : 0,  
    "terms" : [ {  
     "term" : "foo",  
     "count" : 2  
    }, {  
     "term" : "bar",  
     "count" : 2  
    }, {  
     "term" : "baz",  
     "count" : 1  
    } ]  
   }  
  }  
 }  

So a query string was performed with output on the tags count. If the output of the facets is vague, the following are the explanation.

missing : The number of documents which have no value for the faceted field
total   : The total number of terms in the facet
other   : The number of terms not included in the returned facet (effectively other = total - terms )

Another example,

 [user@localhost ~]$ curl -X POST "http://localhost:9200/articles/_search?pretty=true" -d ' { "query" : { "query_string" : {"query" : "S*"} }, "facets" : { "tags" : { "terms" : {"field" : "tags"} } } } '  
 {  
  "took" : 17,  
  "timed_out" : false,  
  "_shards" : {  
   "total" : 5,  
   "successful" : 5,  
   "failed" : 0  
  },  
  "hits" : {  
   "total" : 2,  
   "max_score" : 1.0,  
   "hits" : [ {  
    "_index" : "articles",  
    "_type" : "article",  
    "_id" : "k-Z3lbE9Tx2ZlNDb3ypA8A",  
    "_score" : 1.0, "_source" : {"title" : "Six", "tags" : ["doo", "beep", "ultra"]}  
   }, {  
    "_index" : "articles",  
    "_type" : "article",  
    "_id" : "JJNPiO3_SPOIiliXEfFnRA",  
    "_score" : 1.0, "_source" : {"title" : "Seven", "tags" : ["doo", "boop", "beta"]}  
   } ]  
  },  
  "facets" : {  
   "tags" : {  
    "_type" : "terms",  
    "missing" : 0,  
    "total" : 6,  
    "other" : 0,  
    "terms" : [ {  
     "term" : "doo",  
     "count" : 2  
    }, {  
     "term" : "ultra",  
     "count" : 1  
    }, {  
     "term" : "boop",  
     "count" : 1  
    }, {  
     "term" : "beta",  
     "count" : 1  
    }, {  
     "term" : "beep",  
     "count" : 1  
    } ]  
   }  
  }  
 }  

okay, let's try others facets. A match all query with term on field tags and limit facets output to 3.

 [user@localhost ~]$ curl -X POST "http://localhost:9200/articles/_search?pretty=true" -d '{ "query" : { "match_all" : { } }, "facets" : { "tag" : { "terms" : { "field" : "tags", "size" : 3 } } } }'  
 {  
  "took" : 8,  
  "timed_out" : false,  
  "_shards" : {  
   "total" : 5,  
   "successful" : 5,  
   "failed" : 0  
  },  
  "hits" : {  
   "total" : 7,  
   "max_score" : 1.0,  
   "hits" : [ {  
    "_index" : "articles",  
    "_type" : "article",  
    "_id" : "WZXN-8BcSDehuM-l1tJE3w",  
    "_score" : 1.0, "_source" : {"title" : "Five", "tags" : ["doo", "alpha", "omega"]}  
   }, {  
    "_index" : "articles",  
    "_type" : "article",  
    "_id" : "k-Z3lbE9Tx2ZlNDb3ypA8A",  
    "_score" : 1.0, "_source" : {"title" : "Six", "tags" : ["doo", "beep", "ultra"]}  
   }, {  
    "_index" : "articles",  
    "_type" : "article",  
    "_id" : "JJNPiO3_SPOIiliXEfFnRA",  
    "_score" : 1.0, "_source" : {"title" : "Seven", "tags" : ["doo", "boop", "beta"]}  
   }, {  
    "_index" : "articles",  
    "_type" : "article",  
    "_id" : "cJFllNNOSYa1SxQLaDSGqA",  
    "_score" : 1.0, "_source" : {"title" : "One",  "tags" : ["foo"]}  
   }, {  
    "_index" : "articles",  
    "_type" : "article",  
    "_id" : "76AjyLVST4aRhY0JE2jlAw",  
    "_score" : 1.0, "_source" : {"title" : "Two",  "tags" : ["foo", "bar"]}  
   }, {  
    "_index" : "articles",  
    "_type" : "article",  
    "_id" : "3f3LNtvOT0GmZ4FNpL4wxA",  
    "_score" : 1.0, "_source" : {"title" : "Three", "tags" : ["foo", "bar", "baz"]}  
   }, {  
    "_index" : "articles",  
    "_type" : "article",  
    "_id" : "HccmhIJOTXqX2XG6uGbuXw",  
    "_score" : 1.0, "_source" : {"title" : "Nine", "tags" : ["doo", "gamma", "beep"]}  
   } ]  
  },  
  "facets" : {  
   "tag" : {  
    "_type" : "terms",  
    "missing" : 0,  
    "total" : 18,  
    "other" : 9,  
    "terms" : [ {  
     "term" : "doo",  
     "count" : 4  
    }, {  
     "term" : "foo",  
     "count" : 3  
    }, {  
     "term" : "bar",  
     "count" : 2  
    } ]  
   }  
  }  
 }  

now we want query to show count for all the terms.

 [user@localhost ~]$ curl -X POST "http://localhost:9200/articles/_search?pretty=true" -d '{ "query" : { "match_all" : { } }, "facets" : { "tag" : { "terms" : { "field" : "tags", "all_terms" : true } } } } '  
 {  
  "took" : 3,  
  "timed_out" : false,  
  "_shards" : {  
   "total" : 5,  
   "successful" : 5,  
   "failed" : 0  
  },  
  "hits" : {  
   "total" : 7,  
   "max_score" : 1.0,  
   "hits" : [ {  
    "_index" : "articles",  
    "_type" : "article",  
    "_id" : "WZXN-8BcSDehuM-l1tJE3w",  
    "_score" : 1.0, "_source" : {"title" : "Five", "tags" : ["doo", "alpha", "omega"]}  
   }, {  
    "_index" : "articles",  
    "_type" : "article",  
    "_id" : "k-Z3lbE9Tx2ZlNDb3ypA8A",  
    "_score" : 1.0, "_source" : {"title" : "Six", "tags" : ["doo", "beep", "ultra"]}  
   }, {  
    "_index" : "articles",  
    "_type" : "article",  
    "_id" : "JJNPiO3_SPOIiliXEfFnRA",  
    "_score" : 1.0, "_source" : {"title" : "Seven", "tags" : ["doo", "boop", "beta"]}  
   }, {  
    "_index" : "articles",  
    "_type" : "article",  
    "_id" : "cJFllNNOSYa1SxQLaDSGqA",  
    "_score" : 1.0, "_source" : {"title" : "One",  "tags" : ["foo"]}  
   }, {  
    "_index" : "articles",  
    "_type" : "article",  
    "_id" : "76AjyLVST4aRhY0JE2jlAw",  
    "_score" : 1.0, "_source" : {"title" : "Two",  "tags" : ["foo", "bar"]}  
   }, {  
    "_index" : "articles",  
    "_type" : "article",  
    "_id" : "3f3LNtvOT0GmZ4FNpL4wxA",  
    "_score" : 1.0, "_source" : {"title" : "Three", "tags" : ["foo", "bar", "baz"]}  
   }, {  
    "_index" : "articles",  
    "_type" : "article",  
    "_id" : "HccmhIJOTXqX2XG6uGbuXw",  
    "_score" : 1.0, "_source" : {"title" : "Nine", "tags" : ["doo", "gamma", "beep"]}  
   } ]  
  },  
  "facets" : {  
   "tag" : {  
    "_type" : "terms",  
    "missing" : 0,  
    "total" : 18,  
    "other" : 1,  
    "terms" : [ {  
     "term" : "doo",  
     "count" : 4  
    }, {  
     "term" : "foo",  
     "count" : 3  
    }, {  
     "term" : "beep",  
     "count" : 2  
    }, {  
     "term" : "bar",  
     "count" : 2  
    }, {  
     "term" : "ultra",  
     "count" : 1  
    }, {  
     "term" : "omega",  
     "count" : 1  
    }, {  
     "term" : "gamma",  
     "count" : 1  
    }, {  
     "term" : "boop",  
     "count" : 1  
    }, {  
     "term" : "beta",  
     "count" : 1  
    }, {  
     "term" : "baz",  
     "count" : 1  
    } ]  
   }  
  }  
 }  

how about exclude some term from the facets output?

 [user@localhost ~]$ curl -X POST "http://localhost:9200/articles/_search?pretty=true" -d ' { "query" : { "match_all" : { } }, "facets" : { "tag" : { "terms" : { "field" : "tags", "exclude" : ["boop", "baz", "beta", "gamma"] } } } }'  
 {  
  "took" : 24,  
  "timed_out" : false,  
  "_shards" : {  
   "total" : 5,  
   "successful" : 5,  
   "failed" : 0  
  },  
  "hits" : {  
   "total" : 7,  
   "max_score" : 1.0,  
   "hits" : [ {  
    "_index" : "articles",  
    "_type" : "article",  
    "_id" : "WZXN-8BcSDehuM-l1tJE3w",  
    "_score" : 1.0, "_source" : {"title" : "Five", "tags" : ["doo", "alpha", "omega"]}  
   }, {  
    "_index" : "articles",  
    "_type" : "article",  
    "_id" : "k-Z3lbE9Tx2ZlNDb3ypA8A",  
    "_score" : 1.0, "_source" : {"title" : "Six", "tags" : ["doo", "beep", "ultra"]}  
   }, {  
    "_index" : "articles",  
    "_type" : "article",  
    "_id" : "JJNPiO3_SPOIiliXEfFnRA",  
    "_score" : 1.0, "_source" : {"title" : "Seven", "tags" : ["doo", "boop", "beta"]}  
   }, {  
    "_index" : "articles",  
    "_type" : "article",  
    "_id" : "cJFllNNOSYa1SxQLaDSGqA",  
    "_score" : 1.0, "_source" : {"title" : "One",  "tags" : ["foo"]}  
   }, {  
    "_index" : "articles",  
    "_type" : "article",  
    "_id" : "76AjyLVST4aRhY0JE2jlAw",  
    "_score" : 1.0, "_source" : {"title" : "Two",  "tags" : ["foo", "bar"]}  
   }, {  
    "_index" : "articles",  
    "_type" : "article",  
    "_id" : "3f3LNtvOT0GmZ4FNpL4wxA",  
    "_score" : 1.0, "_source" : {"title" : "Three", "tags" : ["foo", "bar", "baz"]}  
   }, {  
    "_index" : "articles",  
    "_type" : "article",  
    "_id" : "HccmhIJOTXqX2XG6uGbuXw",  
    "_score" : 1.0, "_source" : {"title" : "Nine", "tags" : ["doo", "gamma", "beep"]}  
   } ]  
  },  
  "facets" : {  
   "tag" : {  
    "_type" : "terms",  
    "missing" : 0,  
    "total" : 18,  
    "other" : 4,  
    "terms" : [ {  
     "term" : "doo",  
     "count" : 4  
    }, {  
     "term" : "foo",  
     "count" : 3  
    }, {  
     "term" : "beep",  
     "count" : 2  
    }, {  
     "term" : "bar",  
     "count" : 2  
    }, {  
     "term" : "ultra",  
     "count" : 1  
    }, {  
     "term" : "omega",  
     "count" : 1  
    }, {  
     "term" : "alpha",  
     "count" : 1  
    } ]  
   }  
  }  
 }  

What about if I only want certain fields only? But because this example only has a field, it only show that field, you should try index more fields.

 [user@localhost ~]$ curl -X POST "http://localhost:9200/articles/_search?pretty=true" -d '{ "query" : { "match_all" : { } }, "facets" : { "tag" : { "terms" : { "fields" : ["tags"], "size" : 10 } } } }'  
 {  
  "took" : 6,  
  "timed_out" : false,  
  "_shards" : {  
   "total" : 5,  
   "successful" : 5,  
   "failed" : 0  
  },  
  "hits" : {  
   "total" : 7,  
   "max_score" : 1.0,  
   "hits" : [ {  
    "_index" : "articles",  
    "_type" : "article",  
    "_id" : "WZXN-8BcSDehuM-l1tJE3w",  
    "_score" : 1.0, "_source" : {"title" : "Five", "tags" : ["doo", "alpha", "omega"]}  
   }, {  
    "_index" : "articles",  
    "_type" : "article",  
    "_id" : "k-Z3lbE9Tx2ZlNDb3ypA8A",  
    "_score" : 1.0, "_source" : {"title" : "Six", "tags" : ["doo", "beep", "ultra"]}  
   }, {  
    "_index" : "articles",  
    "_type" : "article",  
    "_id" : "JJNPiO3_SPOIiliXEfFnRA",  
    "_score" : 1.0, "_source" : {"title" : "Seven", "tags" : ["doo", "boop", "beta"]}  
   }, {  
    "_index" : "articles",  
    "_type" : "article",  
    "_id" : "cJFllNNOSYa1SxQLaDSGqA",  
    "_score" : 1.0, "_source" : {"title" : "One",  "tags" : ["foo"]}  
   }, {  
    "_index" : "articles",  
    "_type" : "article",  
    "_id" : "76AjyLVST4aRhY0JE2jlAw",  
    "_score" : 1.0, "_source" : {"title" : "Two",  "tags" : ["foo", "bar"]}  
   }, {  
    "_index" : "articles",  
    "_type" : "article",  
    "_id" : "3f3LNtvOT0GmZ4FNpL4wxA",  
    "_score" : 1.0, "_source" : {"title" : "Three", "tags" : ["foo", "bar", "baz"]}  
   }, {  
    "_index" : "articles",  
    "_type" : "article",  
    "_id" : "HccmhIJOTXqX2XG6uGbuXw",  
    "_score" : 1.0, "_source" : {"title" : "Nine", "tags" : ["doo", "gamma", "beep"]}  
   } ]  
  },  
  "facets" : {  
   "tag" : {  
    "_type" : "terms",  
    "missing" : 0,  
    "total" : 18,  
    "other" : 1,  
    "terms" : [ {  
     "term" : "doo",  
     "count" : 4  
    }, {  
     "term" : "foo",  
     "count" : 3  
    }, {  
     "term" : "beep",  
     "count" : 2  
    }, {  
     "term" : "bar",  
     "count" : 2  
    }, {  
     "term" : "ultra",  
     "count" : 1  
    }, {  
     "term" : "omega",  
     "count" : 1  
    }, {  
     "term" : "gamma",  
     "count" : 1  
    }, {  
     "term" : "boop",  
     "count" : 1  
    }, {  
     "term" : "beta",  
     "count" : 1  
    }, {  
     "term" : "baz",  
     "count" : 1  
    } ]  
   }  
  }  
 }  

What if you want to just count on a certain field?

 [user@localhost ~]$ curl -X POST "http://localhost:9200/articles/_search?pretty=true" -d ' { "facets" : { "doo_facet" : { "filter" : { "term" : { "tags" : "doo" } } } } }'  
 {  
  "took" : 3,  
  "timed_out" : false,  
  "_shards" : {  
   "total" : 5,  
   "successful" : 5,  
   "failed" : 0  
  },  
  "hits" : {  
   "total" : 7,  
   "max_score" : 1.0,  
   "hits" : [ {  
    "_index" : "articles",  
    "_type" : "article",  
    "_id" : "WZXN-8BcSDehuM-l1tJE3w",  
    "_score" : 1.0, "_source" : {"title" : "Five", "tags" : ["doo", "alpha", "omega"]}  
   }, {  
    "_index" : "articles",  
    "_type" : "article",  
    "_id" : "k-Z3lbE9Tx2ZlNDb3ypA8A",  
    "_score" : 1.0, "_source" : {"title" : "Six", "tags" : ["doo", "beep", "ultra"]}  
   }, {  
    "_index" : "articles",  
    "_type" : "article",  
    "_id" : "JJNPiO3_SPOIiliXEfFnRA",  
    "_score" : 1.0, "_source" : {"title" : "Seven", "tags" : ["doo", "boop", "beta"]}  
   }, {  
    "_index" : "articles",  
    "_type" : "article",  
    "_id" : "cJFllNNOSYa1SxQLaDSGqA",  
    "_score" : 1.0, "_source" : {"title" : "One",  "tags" : ["foo"]}  
   }, {  
    "_index" : "articles",  
    "_type" : "article",  
    "_id" : "76AjyLVST4aRhY0JE2jlAw",  
    "_score" : 1.0, "_source" : {"title" : "Two",  "tags" : ["foo", "bar"]}  
   }, {  
    "_index" : "articles",  
    "_type" : "article",  
    "_id" : "3f3LNtvOT0GmZ4FNpL4wxA",  
    "_score" : 1.0, "_source" : {"title" : "Three", "tags" : ["foo", "bar", "baz"]}  
   }, {  
    "_index" : "articles",  
    "_type" : "article",  
    "_id" : "HccmhIJOTXqX2XG6uGbuXw",  
    "_score" : 1.0, "_source" : {"title" : "Nine", "tags" : ["doo", "gamma", "beep"]}  
   } ]  
  },  
  "facets" : {  
   "doo_facet" : {  
    "_type" : "filter",  
    "count" : 4  
   }  
  }  
 }  

you can also use query, similar output as above.

 [user@localhost ~]$ curl -X POST "http://localhost:9200/articles/_search?pretty=true" -d ' { "facets" : { "foo_facet" : { "query" : { "term" : { "tags" : "foo" } } } } }'  
 {  
  "took" : 2,  
  "timed_out" : false,  
  "_shards" : {  
   "total" : 5,  
   "successful" : 5,  
   "failed" : 0  
  },  
  "hits" : {  
   "total" : 7,  
   "max_score" : 1.0,  
   "hits" : [ {  
    "_index" : "articles",  
    "_type" : "article",  
    "_id" : "WZXN-8BcSDehuM-l1tJE3w",  
    "_score" : 1.0, "_source" : {"title" : "Five", "tags" : ["doo", "alpha", "omega"]}  
   }, {  
    "_index" : "articles",  
    "_type" : "article",  
    "_id" : "k-Z3lbE9Tx2ZlNDb3ypA8A",  
    "_score" : 1.0, "_source" : {"title" : "Six", "tags" : ["doo", "beep", "ultra"]}  
   }, {  
    "_index" : "articles",  
    "_type" : "article",  
    "_id" : "JJNPiO3_SPOIiliXEfFnRA",  
    "_score" : 1.0, "_source" : {"title" : "Seven", "tags" : ["doo", "boop", "beta"]}  
   }, {  
    "_index" : "articles",  
    "_type" : "article",  
    "_id" : "cJFllNNOSYa1SxQLaDSGqA",  
    "_score" : 1.0, "_source" : {"title" : "One",  "tags" : ["foo"]}  
   }, {  
    "_index" : "articles",  
    "_type" : "article",  
    "_id" : "76AjyLVST4aRhY0JE2jlAw",  
    "_score" : 1.0, "_source" : {"title" : "Two",  "tags" : ["foo", "bar"]}  
   }, {  
    "_index" : "articles",  
    "_type" : "article",  
    "_id" : "3f3LNtvOT0GmZ4FNpL4wxA",  
    "_score" : 1.0, "_source" : {"title" : "Three", "tags" : ["foo", "bar", "baz"]}  
   }, {  
    "_index" : "articles",  
    "_type" : "article",  
    "_id" : "HccmhIJOTXqX2XG6uGbuXw",  
    "_score" : 1.0, "_source" : {"title" : "Nine", "tags" : ["doo", "gamma", "beep"]}  
   } ]  
  },  
  "facets" : {  
   "foo_facet" : {  
    "_type" : "query",  
    "count" : 3  
   }  
  }  
 }  

To end this article, I leave some homework for you. You should also try the following facets, but do take note on the data type facets operate on.
range          
histogram      
date histogram  
statistic      
term stats      
geo            

In the next article, I will try out the newer version of facets, that is, aggregations.

Saturday, May 23, 2015

Learning oauth2 with linkedin using python script

Today, we are going to learn oauth2 with linkedin using python. It's a very simple and easy task. Let's install the necessary library dependency. In debian, you just will need to install these two dependency packages, that is python-requests-oauthlib and python-oauthlib.

If you want to be sure the module are installed, you can start a python interactive shell, read below:

 user@localhost:~$ python  
 Python 2.7.9 (default, Apr 29 2015, 18:34:06)   
 [GCC 4.9.2] on linux2  
 Type "help", "copyright", "credits" or "license" for more information.  
 >>> from requests_oauthlib import OAuth2Session  
 >>> from requests_oauthlib.compliance_fixes import linkedin_compliance_fix  
 >>>   

Looks good, there is no error. If you do not have apt package manager, you can alternatively use pip to install, such as the following:

 user@localhost:~$ sudo pip install requests requests_oauthlib  
 Downloading/unpacking requests  
  Downloading requests-2.7.0.tar.gz (451kB): 451kB downloaded  
  Running setup.py egg_info for package requests  
 Downloading/unpacking requests-oauthlib  
  Downloading requests-oauthlib-0.5.0.tar.gz (54kB): 54kB downloaded  
  Running setup.py egg_info for package requests-oauthlib  
 Downloading/unpacking oauthlib>=0.6.2 (from requests-oauthlib)  
  Downloading oauthlib-0.7.2.tar.gz (106kB): 106kB downloaded  
  Running setup.py egg_info for package oauthlib  
 Installing collected packages: requests, requests-oauthlib, oauthlib  
  Running setup.py install for requests  
  Running setup.py install for requests-oauthlib  
  Running setup.py install for oauthlib  
 Successfully installed requests requests-oauthlib oauthlib  
 Cleaning up...  

Okay, so you are set, now let's create a python script now for the oauth2.

1:  #!/usr/bin/python  
2:    
3:    
4:  # Credentials you get from registering a new application  
5:  client_id = '***********'  
6:  client_secret = '**********'  
7:    
8:  # OAuth endpoints given in the LinkedIn API documentation  
9:  authorization_base_url = 'https://www.linkedin.com/uas/oauth2/authorization'  
10:  token_url = 'https://www.linkedin.com/uas/oauth2/accessToken'  
11:    
12:  from requests_oauthlib import OAuth2Session  
13:  from requests_oauthlib.compliance_fixes import linkedin_compliance_fix  
14:    
15:  linkedin = OAuth2Session(client_id, redirect_uri='http://127.0.0.1')  
16:  linkedin = linkedin_compliance_fix(linkedin)  
17:    
18:  # Redirect user to LinkedIn for authorization  
19:  authorization_url, state = linkedin.authorization_url(authorization_base_url)  
20:  print 'Please go here and authorize,', authorization_url  
21:    
22:  # Get the authorization verifier code from the callback url  
23:  redirect_response = raw_input('Paste the full redirect URL here:')  
24:    
25:  linkedin.fetch_token(token_url, client_secret=client_secret, authorization_response=redirect_response)  
26:    
27:  # Fetch a protected resource, i.e. user profile  
28:  r = linkedin.get('https://api.linkedin.com/v1/people/~')  
29:  print r.content  

There are basically two variable value you need to change, client_id and client_secret. You should be able to find these two variable in your linkedin developer application page. Once you login to your developer login page, you should be able to locate these two parameters. Note that, because of privacy of credential, I have remove them all, but you should get be able to find out. Note also you need to provide a redirect url, as seen here, it is http://127.0.0.1 match with what it is found in the script.



Now, we are almost ready. If you execute this script in the terminal,

 user@localhost:~$ python linkedin.py  
 Please go here and authorize, https://www.linkedin.com/uas/oauth2/authorization?response_type=code&client_id=**********&redirect_uri=http%3A%2F%2F127.0.0.1&state=*************  
 Paste the full redirect URL here:https://127.0.0.1/?code=******************************************************************************************************************************************&state=***************  
 <?xml version="1.0" encoding="UTF-8" standalone="yes"?>  
 <person>  
  <id>upgggg_g_0</id>  
  <first-name>Dan</first-name>  
  <last-name>Christensen</last-name>  
  <headline>Software Engineer</headline>  
  <site-standard-profile-request>  
   <url>https://www.linkedin.com/profile/view?id=22222222222&amp;authType=name&amp;authToken=abcde;trk=api*************</url>  
  </site-standard-profile-request>  
 </person>  

So you should copy the first link in the script output into the browser, a linkedin authorization page will be loaded, you should authorized it and the next url, copy and paste back into the terminal. Voila, you can use the token to start use various linkedin api.

Friday, May 22, 2015

learning elasticsearch percolator

Today, we are going to learn elasticsearch percolator. But first, what's a percolator? Excerpt from wikipedia,

A coffee percolator is a type of pot used to brew coffee by continually cycling the boiling or nearly boiling brew through the grounds using gravity until the required strength is reached.
But that's coffe's percolator but for elasticsearch's percolator,
The percolator allows to register queries against an index, and then send percolate requests which include a doc, and getting back the queries that match on that doc out of the set of registered queries. 
Think of it as the reverse operation of indexing and then searching. Instead of sending docs, indexing them, and then running queries. One sends queries, registers them, and then sends docs and finds out which queries match that doc.

If that sounds a little abstract, let's dip our hand into the water. Let's start doing experiement using elasticsearch percolator. Start by create an index.

1:  [user@localhost ~]$ curl -XPUT 'localhost:9200/test?pretty'  
2:  {  
3:   "ok" : true,  
4:   "acknowledged" : true  
5:  }  

Then we register a percolator query.
1:  [user@localhost ~]$ curl -XPUT 'localhost:9200/_percolator/test/kuku?pretty' -d '{ "query" : { "term" : { "field1" : "value1" } } }'  
2:  {  
3:   "ok" : true,  
4:   "_index" : "_percolator",  
5:   "_type" : "test",  
6:   "_id" : "kuku",  
7:   "_version" : 1  
8:  }  

Now we start to index, but we need to append _append to the url.

1:  [user@localhost ~]$ curl -XGET 'localhost:9200/test/type1/_percolate?pretty' -d '{ "doc" : { "field1" : "value1" } }'  
2:  {  
3:   "ok" : true,  
4:   "matches" : [ "kuku" ]  
5:  }  

So now we see a match "query" when we index a document 'field 1' equal to 'value 1'. Another way of index, see below, if you have multiple percolators to match with, you can use asterisk.

1:  [user@localhost ~]$ curl -XPUT 'localhost:9200/test/type1/1?percolate=*&pretty' -d ' { "field1" : "value1" }'  
2:  {  
3:   "ok" : true,  
4:   "_index" : "test",  
5:   "_type" : "type1",  
6:   "_id" : "1",  
7:   "_version" : 2,  
8:   "matches" : [ "kuku" ]  
9:  }  

So yes, another match! that's cool! But what if we index specify using the percolator color green?

1:  [user@localhost ~]$ curl -XPUT 'localhost:9200/test/type1/1?percolate=color:green&pretty' -d '{ "field1" : "value1", "field2" : "value2" }'  
2:  {  
3:   "ok" : true,  
4:   "_index" : "test",  
5:   "_type" : "type1",  
6:   "_id" : "1",  
7:   "_version" : 3,  
8:   "matches" : [ ]  
9:  }  

There is no match. Lets index entirely different content, to see if we match the percolator we setup before.

1:  [user@localhost ~]$ curl -XPUT 'localhost:9200/test/type1/1?percolate=*&pretty' -d '{ "field1" : "value33", "field2" : "value2" }'  
2:  {  
3:   "ok" : true,  
4:   "_index" : "test",  
5:   "_type" : "type1",  
6:   "_id" : "1",  
7:   "_version" : 7,  
8:   "matches" : [ ]  
9:  }  

So there is not match. This is pretty cool, you can pre-register a few percolator and when interesting match coming in (index), then if there is any match to the percolator, it will shown in the output.

Instead of query the index data, with percolator you can get the match query during indexing. Something very cool.

Sunday, May 10, 2015

My journey and experience on upgrading apache cassandra from version1.0.12 to 1.1.12

If you have read my previous post on apache cassandra upgrade, this is another journey to major upgrade apache cassandra from version 1.0 to 1.1. In this article, I will share on my experience on upgrading cassandra from version 1.0.12 to 1.1.12.

The sstable version used by cassandra 1.0.12 is hd  and you should ensure that all nodes sstables become hd before proceed upgrade to a newer version of cassandra.

First, let read some highlight of cassandra 1.1

  • api version 19.33.0

  • new file cassandra-rackdc.properties, commitlog_archiving.properties

  • new directory structure for sstable and filename change for sstable.

  • more features/improvement to nodetool such as compactionstats has remaining timestamp, calculate exact size required for cleanup operations, you can now stop compaction, rangekeysample, getsstables, repair print progress, etc.

  • global key and row cache.

  • cql 3.0 beta

  • schema change for cassandra in caching.

  • libthrift version 0.7.0.

  • sstable hf version.

  • default compressor become snappy compressor.

  • a lot of improvement to level compression strategy.

  • sliced_buffer_size_in_kb option has been removed from the cassandra.yaml configuration file (this option was a no-op since 1.0).

  • thread stack size increased to 160k

  • added flag UseTLAB for jvm to improve read speed.

As this is a newer version of cassandra compare the previous, it is always good to setup a test node and so you can play around and get familiar with it before actually doing the upgrade. With this new node, you can also quickly test with your application client which write and/or read to the test cassandra node. It is also recommended to do some load test to see the result is what you have expected.

If you want to be extremely careful on the upgrade, then reading the code changes between the version you chose to upgrade is always recommended. This is the link for this upgrade  and I know and understand as there are huge differences in betweeen them, so you should split as small as possible to read through it. You can learn a lot from the experience coder if you spend a lot of time reading their code and you can learn new technology too. It is a daunting huge tasks but if you willing to spend sometime to read them, the benefits return is just too much to even describe here.

If you upgrade from 1.0.12 to 1.1.12, cassandra 1.1 is smart enough to move the sstable into new directory structure. So, it ease your job that you do not need to move the sstable into the new directory structure. When the new cassandra 1.1.12 starting up, it will move for you.

So you might want to consider prepare the configuration file for your cluster environment before hand. For example, cassandra.yaml, cassandra-env.sh and cassandra.in.sh. By doing this, you can decrease the upgrade process time duration and less error when you are not actually doing it but a upgrade script will symlink this for you. So spend sometime to write upgrade and downgrade scripts for the production cluster and tests it.

Because upgrade process will take time (a long one, depend on how many nodes you have in cluster) and it will tired you in the process (remember, there will be post upgrade issues which you need to deal with), so I suggest you create a upgrade script to handle the upgrade process. The cassandra configuration which you prepare before will be automatically symlink within this script. When you do this, you reduce risk such as factor human error and for a production cluster, you will NOT want to risk anything or cut the risk to as minimum as possible.

There is official upgrade documentation here at datastax but because your cluster environment might be different, so you might want to write the upgrade step taking into consideration from the official documentation and let peer review so you cover as much as possible. Best if your peer will tests and raise in some questions which you might not think of.

If you have using monitoring system such as opscenter, spm, jconsole, or your own monitoring system, you wanna check it out if these monitoring can support the newer version of cassandra.

key cache and row cache per column family based has been replace with global key cache and global row cache respectively. These global cache settings can be found in casandra.yaml file. If you leave it to default, 1 millon key cache by default. Below are some new parameter for cassandra 1.1,

  • populate_io_cache_on_flush

  • key_cache_size_in_mb

  • key_cache_save_period

  • row_cache_size_in_mb

  • row_cache_save_period

  • row_cache_provider

  • commitlog_segment_size_in_mb

  • trickle_fsync

  • trickle_fsync_interval_in_kb

  • internode_authenticator

and below are configuration get removed

  • sliced_buffer_size_in_kb

  • thrift_max_message_length_in_mb

For the upgrade steps in production, these steps are taken appropriately:

pre-upgrade apply to all node in cluster.
* stop any repair , cleanup in all cassandra node and no streaming happened. Streaming are the nodes bootstrap or you rebuild a node.

upgrade steps.
1. download cassandra 1.1.12 and verify binary is not corrupted.
2. extract the compressed tarball.
3. nodetool snapshot.
4. nodetool drain.
5. stop cassandra if it not stopped.
6. symlink new configuration files.
7. start cassandra 1.1.12
8. monitor cassandra system.log
9. check monitoring system.

If everything looks okay for first node, best if you do two nodes, and then continue till the rest of the node in rolling upgrade fashion. After you migrate, you might also noticed there are 3 more additional column families in cassandra 1.1

cassandra 1.0 system keyspace has a total of 7 column families

  • HintsColumnFamily

  • IndexInfo

  • LocationInfo

  • Migrations

  • NodeIdInfo

  • Schema

  • Versions

cassandra 1.1 system keyspace has a total 10 column families.

  • HintsColumnFamily

  • IndexInfo

  • LocationInfo

  • Migrations

  • NodeIdInfo

  • Schema

  • schema_columnfamilies

  • schema_columns

  • schema_keyspaces

  • Versions

If you are using level compaction strategy, these sstable need to be scrub accordingly. There are nodetool scrub and offline sstablescrub for this job. If you have defined column family using counter type, you should upgrade the sstable using nodetool upgradesstables.

That's it and if you need professional service for this, please contact me and I will be gladly to provide professional advice and/or service.

Saturday, May 9, 2015

Light walkthrough on Java Execution Time Measurement Library (JETM)

Today, let's learn a java library, Java Execution Time Measurement Library or JETM. What is JETM?

From the official site
A small and free library, that helps locating performance problems in existing Java applications.

 

JETM enables developers to track down performance issues on demand, either programmatic or declarative with minimal impact on application performance, even in production.

jetm is pretty cool and has a lot of features.

You can follow the tutorial trail here. The following codes are taken from one of the tutorial with minor modification.
public class BusinessService {

private static final EtmMonitor etmMonitor = EtmManager.getEtmMonitor();

public void someMethod() {
EtmPoint point = etmMonitor.createPoint("BusinessService:someMethod");

try {
Thread.sleep((long)(10d * Math.random()));
nestedMethod();
} catch (InterruptedException e ) {

} finally {
point.collect();
}
}

public void nestedMethod() {
EtmPoint point = etmMonitor.createPoint("BusinessService:nestedMethod");

try {
Thread.sleep((long)(15d * Math.random()));
} catch (InterruptedException e) {

} finally {
point.collect();
}

}

public static void main(String[] args) {
BasicEtmConfigurator.configure(true);
//etmMonitor = EtmManager.getEtmMonitor();
etmMonitor.start();
BusinessService bizz = new BusinessService();
bizz.someMethod();
bizz.someMethod();
bizz.someMethod();
bizz.someMethod();
bizz.nestedMethod();
etmMonitor.render(new SimpleTextRenderer());

etmMonitor.stop();
}

}

Hit the run button in eclipse.
EtmMonitor info [INFO] JETM 1.2.3 started.
|--------------------------------|---|---------|-------|--------|--------|
| Measurement Point | # | Average | Min | Max | Total |
|--------------------------------|---|---------|-------|--------|--------|
| BusinessService:nestedMethod | 1 | 4.121 | 4.121 | 4.121 | 4.121 |
|--------------------------------|---|---------|-------|--------|--------|
| BusinessService:someMethod | 4 | 12.611 | 6.196 | 16.347 | 50.442 |
| BusinessService:nestedMethod | 4 | 5.381 | 0.017 | 10.194 | 21.523 |
|--------------------------------|---|---------|-------|--------|--------|
EtmMonitor info [INFO] Shutting down JETM.

So we saw that nestedMethod execute once and four time for someMethod. The result showing a minimum and maximum for the execution with an avarage. Last column shown the total. Pretty neat for a small java library.

 

Friday, May 8, 2015

Elasticsearch no node exception happened in tomcat web container

If you ever get the stack trace in web container log file such as below and wondering how to solve these. Then read on but first, a little background. A elasticsearch cluster 0.90 and client running on tomcat web container using elasticsearch java transport client. Both server and client running same elasticsearch version and same java version.
16.Feb 6:21:30,830 ERROR WebAppTransportClient [put]: error
org.elasticsearch.client.transport.NoNodeAvailableException: No node available
at org.elasticsearch.client.transport.TransportClientNodesService.execute(TransportClientNodesService.java:212)
at org.elasticsearch.client.transport.support.InternalTransportClient.execute(InternalTransportClient.java:106)
at org.elasticsearch.client.support.AbstractClient.index(AbstractClient.java:84)
at org.elasticsearch.client.transport.TransportClient.index(TransportClient.java:316)
at org.elasticsearch.action.index.IndexRequestBuilder.doExecute(IndexRequestBuilder.java:324)
at org.elasticsearch.action.ActionRequestBuilder.execute(ActionRequestBuilder.java:85)
at org.elasticsearch.action.ActionRequestBuilder.execute(ActionRequestBuilder.java:59)
at com.example.elasticsearch.WebAppTransportClient.put(WebAppTransportClient.java:258)
at com.example.elasticsearch.WebAppTransportClient.put(WebAppTransportClient.java:307)
at com.example.threadpool.TaskThread.run(TaskThread.java:38)
at java.lang.Thread.run(Thread.java:662)

This exception will disappear once web container is restarted but restarting webapp that often is not a good solution in production. I did a few research on line and gather a few information, they are as following:

* The default number of channels in each of these class are configured with the configuration prefix of transport.connections_per_node.
https://www.found.no/foundation/elasticsearch-networking/

* If you see NoNodeAvailableException you may have hit a connect timeout of the client. Connect timeout is 30 secs IIRC.
https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!msg/elasticsearch/VyNpCs17aTA/CcXkYvVMYWAJ

* You can set org.elasticsearch.client.transport to TRACE level in your logging configuration (on the client side) to see the failures it has (to connect for example). For more information, you can turn on logging on org.elasticsearch.client.transport.
https://groups.google.com/forum/#!topic/elasticsearch/Mt2x4d5BCGI

* This means that you started to get disconnections between the client (transport) and the server. It will try and reconnect automatically, and possibly manages to do it. For more information, you can turn on logging on org.elasticsearch.client.transport.
* Can you try and increase the timeout and see how it goes? Set client.transport.ping_timeout in the settings you pass to the TransportClient to 10s for example.
* We had the same problem. reason: The application server uses a older version of log4j than ES needed.
http://elasticsearch-users.115913.n3.nabble.com/No-node-available-Exception-td3920119.html

* The correct method is to add the known host addresses with addTransportAddresses() and afterwards check the connectedNodes() method. If it returns empty list, no nodes could be found.
https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!msg/elasticsearch/ceH3UIy14jM/XJSFKd8kAXEJ

* the most common case for NoNodeAvailable is the regular pinging that the transport client does fails to do it, so no nodes end up as the list of nodes that the transport client uses. If you will set client.transport (or org.elasticsearch.client.transport if running embedded) to TRACE, you will see the pinging effort and if it failed or not (and the reason for the failures). This might get us further into trying to understand why it happens.
* .put("client.transport.ping_timeout", pingTimeout)
* .put("client.transport.nodes_sampler_interval", pingSamplerInterval).build();
https://groups.google.com/forum/#!msg/elasticsearch/9aSkB0AVrHU/_4kDkjAFKuYJ

* this has nothing to do with migration errors. Your JVM performs a very long GC of 9 seconds which exceeds the default ping timeout of 5 seconds, so ES dropped the connection ,assuming your JVM is just too busy. Try again if you can reproduce it. If yes, increase the timeout to something like 10 seconds, or consider to update your Java version.
http://elasticsearch-users.115913.n3.nabble.com/Migration-errors-0-20-1-to-0-90-td4035165.html

* During long GC the JVM is somehow suspended. So your client can not see it anymore.
http://grokbase.com/t/gg/elasticsearch/136fw0hppp/transport-client-ping-timeout-no-node-available-exception

* You wrote that you have a 0.90.9 cluster but you added 0.90.0 jars to the client. Is that correct?
* Please check:
*
* if your cluster nodes and client node is using exactly the same JVM
* if your cluster and client use exactly the same ES version
* if your cluster and client use the same cluster name
* reasons outside ES: IP blocking, network reachability, network interfaces, IPv4/IPv6 etc.
* Then you should be able to connect with TransportClient.

https://groups.google.com/forum/#!msg/elasticsearch/fYmKjGywe8o/z9Ci5L5WjUAJ

So I have tried all that option mentioned and the problem solve by added sniff to the transport client setting. 08988For more information, read here.

I hope this will solve your problem too.