As I accumulate more experience in coding and life in general, one of the things among my observations that stood out to me is that, whenever there is a problem to be solved, there usually exists a very intuitive solution to it. Sometimes this solution happens to be efficient, sometimes grossly suboptimal. This is especially noticeable in algorithms, because an algorithm is often times a conceptualization of a real life problem, and through the algorithm the essential elements of the problem is kept, allowing us to tackle the problem directly.

The largest rectangle in a matrix is one such problem, in my opinion. …

When I first started learning to use Kalman filter, it seemed very complicated to me, mainly because there were so many inputs to it and you never really knew if you’ve set them correctly. By going through this example, I hope my readers will understand that the motivations and logic behind it is really very pragmatic in nature and not so complicated after all.

So let’s get started with our example.

I am dropping a tennis ball from a helicopter (or maybe my airship), I would like to predict its position in the air until it hits the ground.

Tarjan’s Off-line Lowest Common Ancestor Algorithm is an interesting application of the disjoint set structure for optimizing the performance of determining the lowest common ancestor(LCA) of two nodes within a tree, which also involves concepts such as caching and recursion. In preparation for this story, please refer to my previous story on Lowest Common Ancestor for context.

Ok, let’s get started!

In order to understand this algorithm, we first have to have a good grasp on the concept of disjoint sets…

*Disjoint sets are collections of nodes, where there is no overlap between sets.*

Sounds simple, right? Actually, it is, it’s just that how you enforce this rule when you build up your disjoint sets requires a bit of artistry. …

It is the lowest level parent shared by two nodes within a tree.

Let’s take a look at an example:

Normal distribution, also called gaussian distribution, is one of the most widely encountered distributions. One of the main reasons is that the normalized sum of independent random variables tends toward a normal distribution, regardless of the distribution of the individual variables (for example you can add a bunch of random samples that only takes on values -1 and 1, yet the sum itself actually becomes normally distributed as the number of sample you have becomes larger). This is known as the central limit theorem. …

A classic formulation of the problem would be:

If I were given a graph of nodes, each node is connected to several other nodes and the connections are of varying distance, what is the shortest path to every node in the graph if I start from one of the nodes in the graph?

In order to implement Dijkstra’s algorithm we first need to define a node and an edge:

`class Node:`

def __init__(self, value):

self.value = value

self.edges = set()

self.distance = -1

@staticmethod

def add_edge(a, b, dist):

a.edges.add((b, dist))

b.edges.add((a, dist))

In this case, I have defined a node that contains a value (an id basically), a distance initialized to be -1 (infinity). Additionally, the Node contains a list of edges, each is a tuple of the target node the edge connects to (from self) and the distance to the target node. …

In portfolio performance analysis, sharpe ratio is the usually the first number that people look at. However, it does not tell us the whole story (nothing does…). So, let’s spend some time looking at a few more metrics that can be very helpful at times.

Sharpe ratio is the ratio of average return divided by the standard deviation of returns annualized. We had an introduction to it in a previous story.

Let’s take a look at it again with a test price time series.

`import pandas as pd`

import numpy as np

from pandas.tseries.offsets import BDay

def daily_returns(prices):

res = (prices/prices.shift(1) - 1.0)[1:]

res.columns = ['return']

return res

def sharpe(returns, risk_free=0):

adj_returns = returns - risk_free

return (np.nanmean(adj_returns) * np.sqrt(252)) \

/ np.nanstd(adj_returns, ddof=1)

def test_price1():

start_date = pd.Timestamp(2020, 1, 1) + BDay()

len = 100

bdates = [start_date + BDay(i) for i in range(len)]

price = [10.0 + i/10.0 for i in range(len)]

return pd.DataFrame(data={'date': bdates,

'price1': price}).set_index('date')

def test_price2():

start_date = pd.Timestamp(2020, 1, 1) + BDay()

len = 100

bdates = [start_date + BDay(i) for i in range(len)]

price = [10.0 + i/10.0 for i in range(len)]

price[40:60] = [price[40] for i in range(20)]

return pd.DataFrame(data={'date': bdates,

'price2': price}).set_index('date')

def test_price3():

start_date = pd.Timestamp(2020, 1, 1) + BDay()

len = 100

bdates = [start_date + BDay(i) for i in range(len)]

price = [10.0 + i/10.0 for i in range(len)]

price[40:60] = [price[40] - i/10.0 for i in range(20)]

return pd.DataFrame(data={'date': bdates,

'price3': price}).set_index('date')

def test_price4():

start_date = pd.Timestamp(2020, 1, 1) + BDay()

len = 100

bdates = [start_date + BDay(i) for i in range(len)]

price = [10.0 + i/10.0 for i in range(len)]

price[40:60] = [price[40] - i/8.0 for i in range(20)]

return pd.DataFrame(data={'date': bdates,

'price4': price}).set_index('date')

price1 = test_price1()

return1 = daily_returns(price1)

price2 = test_price2()

return2 = daily_returns(price2)

price3 = test_price3()

return3 = daily_returns(price3)

price4 = test_price4()

return4 = daily_returns(price4)

print('price1')

print(f'sharpe: {sharpe(return1)}')

print('price2')

print(f'sharpe: {sharpe(return2)}')

print('price3')

print(f'sharpe: {sharpe(return3)}')

print('price4')

print(f'sharpe…

New York government provides a large variety of datasets related to the city, making New York one of the best cities to do data analysis on (I love NYC!). For example: NYC OpenData. From driver applications data to rodent inspection, you can pretty much find any data you can think of…

The Coronavirus data we are going to look at today comes from NYC Health Department. It is a daily updated Github repository that contains NYC coronavirus cases data. All data shown below are as of date 2020–09–29.

Graphs are everywhere. from database to parallel processing, graphs underlie architectures of all kinds in real life. The concept behind graph is so ubiquitous that people develop such an intuitive understanding of it, sometimes they don’t actually realize they are dealing with graphs! So now, we are going to play with a couple of graph algorithms that involves a discussion around depth first vs breath first traversal.

The problem is simple. Suppose I have a combination lock of 4 digits, such as 0000, 0001, etc. I start out at a particular combination of digits, say 1010, and I want to reach another combination of digits, say 9876. At each step, I can increase or decrease one digit by one, like a real lock you use at the gym. …

The fundamental idea behind beta and linear correlation, of course, goes back to the least square approximation that we all know and love.

Briefly reviewing the idea behind linear regression:

Suppose I have an independent variable y, for example number of views I get for my story, and a dependent variable x, the amount of time I spent on the story. I make a guess that the relationship between the two variables are linear:

About