Calibration for deep learning models

Wikipedia’s definition for calibration is calibration is the comparison of measurement values delivered by a device under test with those of a calibration standard of known accuracy. Put in a context that means that the distribution of predicted probabilities is similar to the distribution observed probabilities in training data. If we rephrase it again means that if your model is predicting cat vs dog and the model states that a given image is a cat with 70% probability then the image should be a cat 70% of the time. If the model deviates too much from that it would mean that the model is not correctly calibrated.

Then you may be thinking: Why calibration is important if the model has a high accuracy? Well then let me tell you. For critical decisions such as a car deciding to break or not. The model should be confident enough to give the breaking signal. If there is reasonable doubt then the system should rely on other sensors and models to take the critical decision. The same applies to human doctors and support systems. Decision models in medicine should pass the decision making to human doctors when there is not enough confidence that the model is choosing the right treatment. And last but not least it also helps with model interpretability and trustworthiness.

Originally models were well-calibrated [1] but it seems that the calibration in newer models is less reliable. In this post I do not aim to point architecture flaws. They fall out of the scope. Instead, I want to point ways of visualizing, assessing, and improving the calibration of deep learning models for your architectures. These approaches are framework and model agnostic, meaning that they work in TensorFlow, Keras, PyTorch, or whatever framework you would like to use but also regardless of the architecture model implemented.

The next pair of images show what we want in this post. We want to go from an uncalibrated model to a calibrated one.

From this to this
Poorly calibrated deep learning model example Well calibrated deep learning model example
Figure 1: The ECE stands for Expected Calibration Error (the lower the better). The blue bars are counts of occurrences in a certain bin, and the red area is the difference between expected occurrences and actual ones. So we can see that the first image shows a poorly calibrated model with high ECE (consequently lots of red) whereas the second plot there is very little red and the ECE is very low.

To visually inspect the calibration of the model we can plot a reliability diagram (Figure 1). A reliability diagram plots the expected sample accuracy as a function of the confidence. Any deviation from the diagonal represents a miscalibration. Since we probably don’t have infinite samples to compute the calibration as a line, we need to first bin the predictions into M bins. Then for each prediction in the bin, we need to assess whether the predicted label corresponds to the true label or not and divide the result by the number of predictions within the bin. For those of you that like formulas:

Formula for the reliability diagram for each bin

The Bm are the set of samples whose predicted confidence falls into the interval of the desired bin (in this case, the bin m). Both, y and ลท, are the predicted label and the true label. Notice that in the reliability plot there is no display on the number of samples. Therefore, we cannot estimate how many samples are correctly placed in a bin or the overall model calibration. For that, we need a measure that is able to summarize the calibration of the model into a single scalar value. There are several formulas to compute the general model calibration. In this post, we’ll see the Expected Calibration Error (ECE) [2] which is the one displayed in the images above.

ECE follows on the intuition of the bar plot. It takes the weighted average of the difference between accuracy and confidence for each of the bins. The confidence of a bin is the desired value and is obtained by getting the left and right sides of the bin and computing the average (i.e. dividing it by 2). And the accuracy is the formula presented in the previous paragraph. Then we compute the difference between accuracy and confidence, we multiply it for the number of samples in the bin and divide it by the total number of samples. So formula:

Expected calibration error formula

Now after we do this, we can visualize and assess numerically how good the calibration of our model is. We can also use the sklearn library to plot the calibration curve which is the equivalent of the bins approach we did above where a perfectly calibrated model should produce a line that fits x=y. We can use the sklearn library and plot the curve using the following example:

If you made it this far you may be thinking: Now I know if the model is well-calibrated or not but I have no clue how to fix it. So today is your lucky day! Platt scaling [3] is a parametric approach to calibration. The non-probabilistic predictions (a.k.a. the predicted probabilities of your model) of a classifier are used as features
for a logistic regression model trained on the validation set to return probabilities. At this stage we don’t train the model anymore, the parameters are fixed.

All of these concepts and gists are going to provide you with enough knowledge to assess the model calibration and correct it if necessary. But it shall not affect much other metrics such as accuracy or ROC AUC.


Main paper: Guo, Chuan, et al. “On calibration of modern neural networks.” Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR. org, 2017.

[1] Niculescu-Mizil, Alexandru and Caruana, Rich. Predicting good probabilities with supervised learning. In ICML, pp. 625โ€“632, 2005

[2] Naeini, Mahdi Pakdaman, Cooper, Gregory F, and Hauskrecht, Milos. Obtaining well calibrated probabili- ties using bayesian binning. In AAAI, pp. 2901, 2015.

[3] Platt, John et al. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Advances in large margin classifiers, 10(3): 61โ€“74, 1999.

Worth mentioning II: PhD survival guide, ergodicity, death after 75, content marketing, and email marketing

This is the continuation of the worth mentioning experiment where I shared links that I deemed important to be kept for future references.

A graduate school survival guide: “So long, and thanks for the Ph.D!: On this essay Ronald Azuma states the path to a successful PhD defense. He states a few things that may seem obvious be we often forget. He is not arguing in favor of a PhD title, he simply states that what works for some may not work for others.

The Logic of Risk Taking: I wanted to include this link to remember the term ergodicity explained by Nassim Taleb. An ergodic system has the same behavior averaged over time as averaged over the space. A non-ergodic system is when observed past probabilities do not apply to future processes (Russian roulette).

A doctor and medical ethicist argues life after 75 is not worth living: This is an interview of Ezekiel Emanuel where he argues that trying to keep people alive past 75 is a waste of resources that bring little to the society. Instead, all of those resources should be invested in the younger generations to live a better life. Controversial argument but worth reading it.

Guide for content marketing in 2020.Very well researched and interesting guide for content marketing in 2020. It contains information from email marketing and video marketing to engagement and trends. It doesn’t disappoint at all.

Marketing lifecycle applied to newsletters. In this excellent post, the author maps the different phases of marketing to email marketing. It starts with awareness, then engagement, purchase, retention, growth, and advocacy.

Europe can do better and you know it

Europe for many centuries has been the center of the world. The cradle of the current civilization and the spark that ignited the fire of industrialization. We got into a position where we got comfortable. Across many centuries European countries colonized the world leaving a huge mark still noticeable nowadays. But we got complacent. Luckily we didn’t come to a grinding halt. It takes time to slow down. We were the first countries to be industrialized and keep progressing at a high peace but we are now lacking behind. There is still some advantage and inertia that keeps us in a privileged position but this will end unless we change. Change for the better.

I don’t buy it that Europe is inherently at disadvantage due to the diverse internal variety of cultures (link in spanish). And I certainly don’t buy that Europe should look more like America (link in spanish). I believe that being like China is also out of the picture in most people’s minds. So then what? Shall we die from a thousand cuts? I don’t think so. I believe Europe has enough steam (pun intended) to compete. We have enough brainpower to find the European way.

China is killing it. They have massive tech companies that reach massive numbers of consumers. Their government legislates so it is easier to obtain certain advantages for their local companies. For AI companies it is relatively easy to obtain incredibly large datasets. Also keep in mind that some of the most relevant hardware companies are Chinese (e.g. Huawei, which not only has cellphones but also infrastructure hardware). In the meanwhile, USA is brainwashing the European population. Not only through TV and cinema. But though technology. Most of the software also comes from the new world (especially social networks and operative systems) they own the platforms. At the same time, they hire smart and highly motivated people. Thus extracting value twice.

It’s time to start writing our own playbook and kill it. Reinforcement loops and inertia are there. We can become a tech powerhouse and stop depending on China and the USA. Silicon Valley was developed over many years thanks to the department of defense and Standford. China did it by themselves based on entrepreneurship and federal advantages. Europe so far seems to be focusing to use the bureaucratic way. Spoiler: it is not working. GDPR and other brain farts hurt more the internal system more than the external forces. Hostile companies are big and can endure these changes at ease, but small companies and startups are doomed. The impacts are much larger for them. And then, when it comes to being serious the justice chickens out. It should have been an unmerge. And the same goes for Google.

It’s good that there are regulations that prevent some external hostilities. But at the same time, Europe should seriously invest in European forces that can really occupy the market space. And this is a strategic point much needed and apparently ignored. We cannot rely on external companies for such strategic markets. Specially after seeing the Russian inference (because USA didn’t use similar approaches #irony). At the same time, we should aim to extend our reach and not limit ourselves to the internal market. In this way, we make sure that our tech powerhouse is competitive. Otherwise, we will fail in protectionism and conformity without adding value. Like Studwell suggest in his book.

In conclusion, we should learn to look inwards and acknowledge that we have the material and the right tools. We don’t need to look outwards. We’re thw powerhouse of many industries and we can keep it like that. It is time to rise and shine.

Tired of self empowerment

We are living in a hyperventilated society. Entrepreneurs, makers, all of us. There was a time where truly big people were defined by their achievements in life, the accumulated experiences. Now, this has changed. It does not really matter anymore. Your current projection is what matters. What are you doing? Where are you going? Don’t stop! Otherwise, you’re suspicious. What is that person doing standing still? It’s disturbing. Suspicious. Very suspicious. No one will truly believe that you’re resting, thinking, reflecting, or taking a sabbatical. You have something to hide, something is wrong. Did you get fired? Are you sick? or even worse: you don’t have self-discipline. I don’t care what you did or achieved in the past. In fact, no one cares. It’s no longer interesting, you are no longer relevant. Can’t you do anything? Are you outdated?

In the past, we first went to God for answers. After, to the estate. Now we have no one. You’re by yourself. Empower yourself. Do-it-yourself. Extreme ownership. You’re born alone and die alone. Don’t expect anything from anyone. Take care of yourself. Do you want to learn to play the guitar? There are plenty of youtube channels and websites where they teach you. Internet is the world’s knowledge at your fingertips. You can learn from the best in a few clicks. If you don’t speak Spanish it is because you’re lazy. There are plenty of courses, apps, and different kinds of tools and methodologies available. Many of them even for free. What are you waiting for? Don’t you tribute in a fiscal paradise? That’s because you haven’t put enough interest. It is up to you to do so.

Do you have a boring life? It’s your fault and only yours. You should be living at your maximum potential. Living a breathtaking life. You should transcend normality. You’re unstoppable why doing normal projects? It should be wonderful, amazing! If not, you’re doing something wrong. What you’re building is what defines you. The world is now living in vanity. Everyone is narcissistic. Exaggerate your story. Be overly optimistic. Make up for your numbers. Wear unicorn clothes. Sell your company for millions. Was it full of false promises? It doesn’t matter. Nobody creates societal value anymore. What’s important is that you got rich. Fuck them.

Don’t you think it’s tiring to save the world day after day? It’s also good to slow down. Doing little. Tiding the mess. Throwing away the baggage. Observe. Think. Reflect. Give yourself time to find a new canvas. Grow as a person. Transition to a new space. To move you sometimes need to stop and check the map. If you don’t stop you get lost. You get tired. Overworked. Overwhelmed. There are some people who are broken but cannot stop. Don’t be like them. You won’t go outdated if you stop. Don’t be afraid.

Book summary: How Asia works by Joe Studwell

Many years ago (around 7) I read poor economics. It was a book about developing nations and what challenges they are facing. I even did a review in Spanish. This is a topic that aligns with my quest for a better world. Joe Studwell, in the book How Asia works, analyzes different Asian countries. He analyses Japan, South Korea, Taiwan, Indonesia, Malaysia, Thailand, the Philippines, Vietnam, and China. Comparing the rights and wrong each country did and what they could have done better acknowledging why some countries did better than the others.

The book is well documented and has a good literary style. Not the typical boring and dense academic approach to science. Just be aware that he analyzes what takes for countries to go from poor to rich which is not the same formula for already rich countries. In the book, Joe is quite harsh on IMF and the world bank. He argues that these large bureaucratic institutions are keeping poor countries poor due to bad advice not fitting their economic status.

One of the first key points is the idea that “poor countries” should maximise their farming yield. The maximization should be aiming to get the maximum output per acre of land; thus, forcing people into labour intensive jobs on small farms. Studwell proposes a land reform as the path to achieve high yield/acre. Expropiating land from rural landlords to give it back to the farmers in theory should prevent communism and motivates the farmers to do the best job they can. At the same time the government should be aware that farmers cannot stand alone and need external aid. Farmers need services and infrastructure. Once a good farming base is set the next step is manufacturing and industry.

After the farming stage is completed or near completion the country tilts toward industrialism. There is not enough land for everyone so people move towards the main metropolis. This movement is accompanied by an industrialization of the economy. Studwell often repeats that manufacturing companies should aim for exports to be more productive. He calls it “export discipline”. By forcing companies to export the government is making sure that they reach the global market and keep high standards for they products. The companies that that find a niche and produce quality resources will succeed at exporting and survive the test of time. Although he argues that at early industrialization stages the companies shall be protected, gross exports are much more important than the net exports.

There are several strategies to improve the quality of the manufactured products. The first one is through collaborations with more advanced companies from other countries. Collaboration to learn the know how and the state of the art of the processes for the industry. Then an important Darwinian management from the government should be applied. The companies should be let fail (go bankrupt) or force merges to accumulate volume and knowledge. It should be about allowing internal competition while protecting the baby companies but forcing them to compete in the global market. The government should not pick the winners. The key point is to achieve an accelerated technological upgrading.

In conclusion, I liked this book although towards the end it became repetitive. These may be some good ideas that African countries and less fortunate Asian ones could apply to their economies. It never ceases to amaze me how different countries evolve differently. But most surprisingly is how some countries in a ver, very short period of time turned around and became super-powers.

10 Interesting Conversation Starters from Vanessa Van Edwards

Ten questions:
1) What is one thing that you have always wanted to try but never have? Why haven’t you done it yet?
2) What has been the highlight of your year so far?
3) What book or movie character do you most relate to?
4) What is your biggest regret?
5) What do you daydream about?
6) Who would you play in a movie?
7) If you could trade places with anyone for 1 week, who would it be?
8) What is something that you have always wanted to learn?
9) What is your best memory?
10) What question have you always wanted to ask me? What do you wish you knew about me?

And here the full video:

We got the trolley dilemma wrong

The trolley dilemma or trolley problem is a thought experiment in ethics. The problem states that we have a trolley on railway tracks going directly towards five people. They are tight to the tracks and therefore cannot move. The trolley is also unstoppable. You are standing next to a lever with a full picture of the situation. If you pull the lever the trolley will switch to a new set of rails. However, the new direction has one person tied to the tracks. Then you have two options:
1. Pull the lever and the trolley will kill one person
2. Don’t pull the lever and the trolley will kill five people

This is a dilemma because if you pull the lever you’re being utilitarian and you end up killing one person and preventing the death of five. Whereas by not pulling the lever you take no moral responsibility and five people die. This problem is interesting because it has many dimensions. For example, the only person that would die by pulling the lever could be someone you love, then you may be inclined to let the five die.

Why I’m telling this? Why we got the dilemma wrong as stated in the title?

Well because it’s not about the people who would die, but about the trolley company that would get the service interrupted. With the current pandemic, the famous coronavirus COVID-19, politicians and other authorities have been faced with a novel problem. Something that happens once in a lifetime. Something that we hope will never see again. Since January different countries have been forced to take extreme measures to prevent the spread and contain the virus as much as possible so the health system did not collapse. As I’m writing this some countries have failed but it is still important to minimize the spread to give time to the hospitals to treat successfully the maximum amount of people.

The health care system has collapsed in some countries because politicians were reluctant to apply the right measures on time. In part, I attribute this to the fact that successful prevention would have been deemed as an over-reaction. However, in Spain they had a 2-week forecast by looking at Italy. Then why did the government fail miserably? Because of the trolly company. The trolley company here represents the economy and our politicians reacted late because they didn’t want to disrupt the economy. Some have already wondered if it would be better to let the elderly die for the benefit of the economy while others think that the show must go on regardless of the consequences.

With an economic collapse and meltdown on the foresight, USA, Europe and others already prepared incentives to smoothen the fall. Some even argue that these incentives won’t prevent the economy to fail and certainly won’t help much the average citizen.

Those incentives are predominantly oriented to save the economy, not to the individual who is suffering. Such a grinding halt on the world economies has affected negatively a lot of people. People who cannot go to work. Small business owners are forced to shut down indefinitely. Any bar, restaurant, disco is closed until further notice. Hotels have close to 0% occupancy. And yet the government is worried about the big business. Too big to fail? Should we save big business because then indirectly we are saving lots of jobs? When I say “we” is because that money comes from our taxes the ones taking the decisions have other tax liabilities. Maybe in some cases we should save some business. But it’s my belief that those rescued businesses should suffer. CEOs should not be artificially pumping the stock so they get a bonus. CEOs should prepare for the future the best they can. It’s very advantageous if the wins go to the CEOs/companies and the losses to the taxpayer. But that’s not the way it should be because it incentivizes a reckless way of doing business and it penalizes businesses that were conservative and prepared for the long run. The unprepared ones should be let to fall. It incentivizes the survival of the fittest. However, if a company is deemed strategically important I’m in favor of rescues but with conditions. Rescuing should not be free money. It should come as shares or other assets that will return value back to the taxpayer.

So in conclusion. Are we f*cked? I would say yes a little, but with every collapse new opportunities arise. This may be the time for a paradigm change. Maybe we should repurpose the economy for the greater good instead of value extraction. Maybe it’s time to dust off old ideas and repurpose them for the modern world. Maybe it’s time to implement the Tobin tax and prevent short-sighted speculative trading with little value creation. Maybe we should take hazardous activities into the value calculation where polluting costs money. Maybe it’s the time to consider happiness, health and wealth being equally important metrics to optimize for the countries. Maybe we should reconsider widening the GDP definition and adding other non-monetary factors. Or simply we can give up, look down and follow the current system where rich people have assets and are protected against many setbacks while most people have to work for a living and this downturn may imply severe consequences for them.

How to add ROC AUC as a metric in Tensorflow / Keras

The way to add the ROC AUC as a metric on your Tensorflow / Keras project is to copy this function that computes the ROC AUC and use the function name in the model. The function only requires a little customized tf code.

To use the function in the model. We first need to compile with the function passed directly and not a string (as it is shown in the example below).

Then we can use it in the callbacks but we need to refer to it as a string (so this time between the “” as shown in the snippet below).

Other customized functions follow the same pattern.

The function tf.keras.metrics.AUC already implements the Area Under the Curve natively in TensorFlow in the Keras module. You can check the documentation for further information in this regard.

How to insert code snippets in WordPress or any other website

I was recently struggling to insert code snippets into this blog. I tried SyntaxHighlighter Evolved among others. And I didn’t like the outcome… (CSS is messed up?). The way to go?

1. Create a Gist thanks to Github ( )
2. Copy the embedded JS script they offer you after publishing the code snippet.
3. Put it on your website

Nice, isn’t it?

Forcing the disagreement

Currently, the idea that the crowd is wiser than any single individual is widely spread. Places like Wikipedia and Quora rely on the network effect of the internet to benefit from collective human knowledge. The idea behind this is that the aggregated knowledge of the population is able to find the right answer and cancel out the noise. But the crowd may not be wise.

The un-wiseness may be caused by the herd mentality. Once several people group together and discuss the points and actions to take only a few ideas my prevail. People with stronger charisma or higher in the hierarchy may enforce ideas on their peers without knowing. Leading everyone in the meeting room to buy into it and go with the flow.

This is why it is important to place different mechanisms to question and re-evaluate different action plans decided in meeting rooms. The tool that I’m here proposing is a forced disagreement. If everyone in the room agrees the last person to agree on the plan has to disagree even though he or she strongly believes in it. This person has to do the research and argue why this idea is wrong and work on an alternative approach. At every meeting the person who disagrees should change; thus expanding the “thinking out of the box” across the people in the team.

This tool forces people to think outside the box and not follow the pack mentality. This is important because not always the agreed decision is the best. The disagreement, although artificial, creates an intellectual tension that incentives continuous growth within the group. It prevents conformity and may lead to overlooked but better solutions.