For goodness of fit, you can also just do a binned chi-square test in most situations.

]]>LingPipe Cookbook. Available on Amazon too. ]]>

What do you recommend me to do so? Please indicate me resources and other links to perform such NLP tasks? Thank you a lot for your excellent tutorials and supports. ]]>

Even though this seems like a non-issue for statisticians, many non-statisticians can experience significant confusion due to the notion that a density function is just another static function as in ordinary calculus.

I am particularly comfortable with using p all the time and rely on the actual variables inside the parentheses to signify the corresponding density function.

However, we often find that some densities in our application are significant enough to justify giving them their own names like f, g or h. Another situation where we genuinely need specific names for density functions is when a single random variable has multiple possible densities depending on the context.

This practice, however, leads many to think that f, g or h are just ordinary calculus functions, which is indeed not true. Because density functions can be marginalized to reduce its dimensions, or conditioned upon another random variable, while ordinary functions cannot. For example, say, f(a | b) is the conditional density of a given b. Therefore, the shape or location of f is affected by b, which is also a random variable. Hence, f has no static shape or location. What we really want to communicate by f is a law of dependence between a – b, and this law can be static, which is why we are motivated to use f instead of the generic notation p in the first place.

For statisticians, it is quite natural to also write that integrating b out of g(a,b) will give g(a), which is to use g for both the joint and marginal density. This is justifiable because the joint density implies its marginal densities. Or sometimes people also write h(a,b) = f(a)g(b | a) which looks utterly confusing to most outsiders. However, this notation immediately makes sense when we realize that replacing f, g and h with p will give the familiar formula of the joint density of a and b.

In conclusion, I think the confusion is genuine and significant among many researchers. However, as I mentioned, this confusion seems to have legitimate reasons and a simple solution is not yet available. My current solution to this situation is to keep reminding myself that f, g or h are nothing else but specific variants of p, and hence they are not just calculus functions.

]]>I wrote everything up in a much more sophisticated way statistically, and posted it as a knitr with examples running in Stan:

https://github.com/stan-dev/example-models/tree/master/knitr/pool-repeated-trials

I could mail the knitted HTML, but knitr’s very inefficient with the images and it’s something like 5MB.

]]>Also, why did you only want one sample, as the first input argument in the line:

sampledAvg = rbeta(1,1+hits,1+atBats-hits)

[java] Exception in thread “main” 410:

[java] {“errors”:[{“message”:”The Twitter REST API v1 is no longer active. Please migrate to API v1.1. https://dev.twitter.com/docs/api/1.1/overview.”,”code”:64}]}

[java] TwitterException{exceptionCode=[f3acd3ed-00597cc4], statusCode=410, retryAfter=0, rateLimitStatus=null, version=2.1.7-SNAPSHOT(build: edd6ee1c26db6d1ede32dc6940b5acb4c48e0d96)} ]]>

This blog is a godsend, even in 2015. I’m working on fuzzy matching data from two data sources in SQL and trying to get a deeper understanding of Jaro-Winkler. ]]>