Few days ago, I started to work on a simple url shortener. The original idea was from one of my friends, he mentioned about the idea of implement a simple and deployable url shortener. This idea just popout my head when I was working on other side project. So I spend a few days working on it and it turns out to be an interesting project that I want to blog about:
A Url Shortener
A url shortener, as its name, is a application that provide user a shorter version of url. Like the most famous one: tinyurl.com, which map the url address into a short sequence of code, like tinyurl.com/ccxngqo. Make user easier to share the link on social media like twitter or forum.
What the application do is actually pretty simple, they store the url in the table and set a id for it. When user request the id, redirect user to the url. The problem here is: how to make the id shorter? The obvious solution is to use the character as id but not integer id. So most url shortener will use the code like ‘aSgY2’, which combine the upper, lower case letters and numbers as id. With this method, one character can present 26+26+10=62 choice. A 5-letter code can present 625=916,132,832 of urls. far better then just use 5 numbers.
There are two methods to map the long url into short code, one is to convert the increment interger id into code. Which is easier and can use up all of the space. However, the space (of 5-6 digits code) will be running out eventially because there are billions of urls on internet. So it have to increase code length to increase the storage of url. Or it have to expire the unused url and reuse these code. But when it comes to expire, The incremental id will become more complex since it need to track the smaller free id first. Also, the application have to maintain a index of url -> id to increase the search speed, and it will have to maintain index often if the url will expire.
Another solution is to hash the url address, so the url will match to a code generated by url string that same url will always map to same code. This method works well with the expire method because application doesn’t need to track which code is free. However, this method won’t be able to use up all of the available codes and the hash will collision(which is, two different url generate the same code) when numbers of url increase.
For the project, I choose the hash method to generate code. Because this method works well with redis hash map data type. Lets bring some code in the project:
The hash here use the node.js crypto module to hash the url, and reduce the code by mod the hash result into the range of 5 digits (916,132,832). And map the number into usable symbols as return code.
But, what happen when the code collide? and how often will it happen when the url increase?
(Warning, this section include some math!)
The hash collision problem, can be seen as a generalize birthday problem. Which is: what’s the probability of two people have same birthday for N people?
Lets solve the birthday problem first:
First, the number of days in a year is 365 (exclude the leap year). So the probability of at lease 2 people have same birthday equal to 1 - probability of all people have unique birthday.
Probability_of_same_birthday(N) = 1 - Probability_of_unique_birthday(N) Probability_of_unique_birthday(N) = 1 * 364/365 * 363/365 * … * (365 - N + 1)/ 365 = ( 365 * 364 * … (365 - N + 1)) / 365^N = 365!/ ((365 - N)! * 365^N) = N! * Combination(365, N) / 365^N
In the hash context, the range of code is 62 ^ 5 = 916,132,832, so the probability of having all unique hash in N url is:
Probability_of_unique_hash(N) = N! * Combination(916132832, N) / 916132832^N The approximation of Probability is: P(N) = 1 - e^(-n^2/916132832*2)
So the probability of collision in 916132832 space with n hash P(n) is:
P(10) = 5.458 * 10^-8 P(1000) = 5.456 * 10^-4 P(100000) = 0.995 P(200000) = 1 P(N) = 0.001, N ~= 1355 P(N) = 0.01, N ~= 4292 P(N) = 0.10, N ~= 13895 P(N) = 0.50, N ~= 35638
As the birthday problem, the collision chance is actually higher than we thought, the 916132832 space can only hold around 35638 hash without too much collision.
So to solve the collision problem, I choose to use the other hash method when collision happen. The node.js crypto module have md5, sha1, sha256, sha512 can generate different code. So if collision happened, it will switch to another hash method. If all four hash collide, the code will increase one digit to hold the new hash. And start with md5 to generate hash again.
Here’s the code that rehash url:
With this method, I set the expire time as one month, and the application don’t need to track the expiration of url and will automatically reuse the empty spot. There might have duplication of url, but the probabillity is small, since the url used often will always occupy the first or second hash spot.
This project introduce some pro and cons on using increase integer or hash code as id. So we know why the url shortener is implmented in a certain way. I would like to know how the bit.ly and tinyurl.com do the url shortening under the hood. Because the fall back hash method still need some prove to be usable under large amounts of urls.