The new software: Advantages and disadvantages

Recently I published a post about the new software paradigm. The new paradigm of software is the one that the coder does not directly program each of the cases for any of the given inputs. The new paradigm uses training data to let the computer learn the outputs for each input. The computer programs itself while the coder and the software developer’s job is to prepare the data and set the target. The new paradigm seems to bring new challenges that need balancing. It has, like everything, some advantages and disadvantages.

On the advantage side, deep learning requires easier hardware. Even though deep learning is more complex than traditional software, neural networks only consists of two basic operations: matrix multiplication and thresholding. Traditional software has many more basic operations like conditionals, loops, etc. Because of that traditional CPUs are required to be able to run many more distinct operations increasing the overall complexity of the hardware. Where traditional software uses CPUs, deep learning uses GPUs. GPUs are developed to perform matrix multiplications making them ideal for neural networks.

Since deep learning is inherently less complex on hardware, one could easily embed already trained models into chips to perform certain tasks. This will make chips low power and specifically designed for defined tasks. If the task at hand is better defined and specific, hardware can be developed so that it is more efficient making it prone as well to scale economies. Imagine a manufacturing chip that performs speech analysis. In a way, this is more portable than code. The code often needs to be recompiled on the system to ensure that works and has optimal performance. A chip would have everything integrated from the microphone to the model.

Once the neural network is trained. Execution time and memory use are constant. Deep learning prediction only requires one forward pass which always requires the same amount of operations and memory. Traditional software different branches of the code require different amounts of resources which leads to a wide range of execution times and resource consumption (i.e. memory allocation). At the same time with traditional software, the branching adds complexity for the developer to not forget any case open, untreated, or even an infinite loop. An infinite loop of the traditional software would render the program unresponsive for eternity. This specific bug does never happen in neural networks.

Contrary to the traditional software that remains untouched for most of their life, deep learning models have the ability to evolve, mutate, and mature to reach and maintain the global optimum. In traditional software calls, APIs, and modules are created alone and die alone. Unless changed software stays the same forever. In this new paradigm, your call helps the model to find a better solution, and be faster and therefore adapt and improve. The more use the better the whole system becomes. You browsing through the internet would help to build a better internet. This software would learn from usage.

Even though there are instances where certain coders would be able to write better code than deep learning, in general, neural networks are better pieces of code. The same holds for high- and low-level languages. Even though most coders do a good job programming high-level languages and let the compiler do their magic, there is an extremely small fraction of people that are able to tune assembly instructions to improve the performance. Small local improvements do not reflect global gains.

But not everything that glitters is gold. Neural networks may be able to achieve 99% accuracy but it may be impossible to understand how this is reached. In some cases, a 90% accuracy that humans comprehend may be preferable. This is a tradeoff that should be assessed case by case. Would you rather get a treatment 99% certain without knowing what is it, or 95% certain but a doctor can explain to you?

Maybe humans don’t trust machine learning models because they may be picking up biases. If it’s true it is on everyone’s best interest to prevent that. It’s in humanity’s best interest that technology remains neutral. But this may be harder than we thought. Microsoft trained a twitter bot on twitter and it turned out to be racist, the reason is that some data cleaning is required, not everything has value. And amazon built a misogynist HR algorithm because they have biased data to start with. So it turns out that machines reflect what’s there without the power to argue against it.

In a way we want the machines to judge the outputs. For everyone’s comfort maybe we should be supervising all algorithms. Deep learning will provide an answer to the data. It may not always be the right answer, but sometimes it may be so wrong that a person will easily pick up. Models do not tell you when they don’t know, if it was not properly trained on specific regions of the data it will provide an embarrassingly wrong result.

Malicious people may use this lack of judgment to benefit from AI. This new software will require new security measures. Attacks against neural networks are different in nature than the attacks that can be done on traditional software. Misleading is as easy as finding a way to mislead the algorithm. If they are somehow public it’s a matter of time until somebody realizes that some tweaks produce vastly different results.

This new paradigm will require new tools. The old software needed debuggers, profilers, and IDEs. The new paradigm is done by accumulating, cleaning, and processing datasets. The new debugging will be done by adding new data with the labels the model fails to recognize. The IDEs may show the model architecture, the training data, and labels but it may also show which data points may be relevant to include so that the model gains certainty on the prediction. Maybe the new era Github is about datasets and trained models.

The new software: Less coding more data

Software like everything is evolving but it is evolving differently than I thought. When I was studying computer science at the university I thought that the future was parallelism. We were taught only one class in parallel programming. Multi-core computers were on the rise and it seemed to be the thing to learn. Since then my opinion has changed. There is indeed the need for parallel programmers but it is not as big as I had foreseen. Most of the libraries and already implemented code that needs parallel programming are already implemented. Some basic notions are required and every now and then.

They also teach “sequential” (regular) programming at the university. Regular programming until now is about giving specific instructions on what actions the program should take. Each line of code contains a specific instruction with a defined goal. The programmer wrote code indicating at each step which actions the computer had to perform without leaving room for the computer to improvise. All cases are defined in a way or another. If not, the program breaks. This is also true for parallel programming where the code defines the actions to be performed, and in which order. The difference between single-core and multi-core is that single-core executes all the instructions in a sequential manner whereas the multi-core can have multiple executions of different parts of the code that do not necessarily follow the same paths.

On the new paradigm, the one that Andrej Karpathy made me notice, and I agree with, software is abstract. Software becomes the weights of neural networks and as humans, we cannot interpret nor program it directly. Therefore the goal in these instances is to define the desired behavior. The software developer should then define that for these sets of inputs we want this other set of outputs. For the program to follow the right behavior, we need to write the neural network architecture to extract the information from the domain and then train it so that the program searches the space for the best solution. We will no longer address the complex problems using explicit code; instead, the machine will figure out by itself.

Software is certainly changing in a new direction. There are many instances where it may be easier to gather more training data than actually hard coding an algorithm to perform a specific task. In the new software era, the coders’ tasks are to curate the datasets and monitor the system. The system is optimized to perform the task in the most accurate manner. “Old school” programmers will be sill in need in the same way there are still people who code for a living on low-level languages. Old school programmers will develop labeling interfaces, create and maintain infrastructure, perform analysis, etc.

Nowadays it is clear that neural networks are the clear winner over hardcoded instructions in many different domains. With the current software, McKinsy states that 30% of companies’ activities can be automated. Machine translation, image recognition, text analysis, and games like chess or even ‘League of Legends’, which requires an advanced understanding of the universe, can be automatized by computers. Google reduced 500 000 lines of code to only 500 from the translate program thanks to TensorFlow and neural networks. These are the classic examples where deep learning is straight forward and can shine but there are other less intuitive (and less sexy) domains where huge improvements can occur like data structures and databases. In this example of not sexy publications, the deep learning software was up to 70% faster and used an order of magnitude less memory than the traditional software. As the last example, I would like to bring this article where researchers brought this idea to the extreme. They created software that does not even require to define the model to be used and instead finds it for the user.

In conclusion, traditional software, the one that we have been using until now, will remain. There are some instances like law and medicine where black-boxes cannot be accepted and won’t be tolerated. Other times it will be more cost-efficient and easy to hard code the features instead of preparing training data and letting the model figure out. The new software era will be the one where coders will not explicitly write the course of action for each case, instead the neural network will find the best solution for a given input. The software will find out the best solution for a given problem without the explicit instructions of a human. The new paradigm software will become more prevalent and along with these lines new software tools will be developed.

Want to reduce the plastic usage? Try the wellness approach

We’ve been facing the problem of plastic usage wrong.
Plastics cannot be recycled [1] and when they can be recycled it is not an easy straight forward process. Yet, everyone is using them out of convenience. They are cheap and nobody cares about the environment when it comes into conflict with their pocket.
The solution? Studies should be found to find if plastics are nocive to human health.
If people worry that they or their kids will get cancer, they may rethink the usage of this material. If something shortens your life expectancy and provides you with a long and awful death people may reconsider buying products that are wrapped with multiple plastic layers.
There is probably going to be part of the population that won’t mind and will immolate themselves. But if consumers have the perception that all that plastic wrapping around food is bad for them it may change more things than telling them to try to reduce plastic consumption.

[1] Hopewell, J., Dvorak, R., & Kosior, E. (2009). Plastics recycling: challenges and opportunities. Philosophical Transactions of the Royal Society B: Biological Sciences, 364(1526), 2115-2126.

Networking

Go to the bottom of your SMS and text people regularly something like:
Hi Joe,
It’s Roc here, we haven’t spoken in ages!! I hope you’re fine. What’s the latest with you? No rush on the reply

Don’t skip people
The same for emails. But search for random letters and see what the autocomplete in the “to” field shows.
Look at Linkedin (and other social media) and prise people through a private channel for something they did recently. No likes or comments on the post.
Don’t get discouraged if people take a while to reply or don’t reply at all.
For conferences prepare dossiers about the people you wanna meet so you know their professional side but also the personal one. If you talk hobbies it feels less boring and people may be more engaging and happy to interact.
Always keep a good posture. A trick is to straighten your back every time you go through a doorway.
Ask for help. Ask for introductions for your projects. People can’t read your mind but don’t be pushy.
Tornado technique is an elevator pitch where you say what you can do for others/them. Instead of using the proper definition say what is the goal. I do protein synthesis. It makes no sense for most. Say you create proteins as medicine.
The reverse tornado technique asks what they do for others.
Ask for facts, then emotion and then why as a way to build conversations with people.
If something is obvious maybe ask for some specifics.
Last but not least reduce noise. Look at your calendar the people you met and think if they are high quality or not. And if not limit your interaction.

Convert a mask into a Polygon for images using shapely and rasterio

Sometimes it is necessary to transform masks into polygons to use polygon operations. The way to transform a raster or a binary mask into a polygon is pretty easy. Most of the solutions I found online were the other way (from polygon to a mask). But for my project I needed to convert the mask to a polygon.

Time at Risk

Time at Risk == Value at Risk
(if we replace value by time)

E.g. An insurance company’s 90% TaR is 3 year for liquidity risk
– That means that for 3 years the insurer under the current financial structure would be 90% safe

Time at Risk (TaR for short): is the maximum period of time that an adverse event would not occur. We calculate it as follows:

Incidence rate formula
Incidence rate = \frac{Number of disease cases in a given time period}{Total person time at risk during study period}

The units are cases/person-year

Calibration for deep learning models

Wikipedia’s definition for calibration is calibration is the comparison of measurement values delivered by a device under test with those of a calibration standard of known accuracy. Put in a context that means that the distribution of predicted probabilities is similar to the distribution observed probabilities in training data. If we rephrase it again means that if your model is predicting cat vs dog and the model states that a given image is a cat with 70% probability then the image should be a cat 70% of the time. If the model deviates too much from that it would mean that the model is not correctly calibrated.

Then you may be thinking: Why calibration is important if the model has a high accuracy? Well then let me tell you. For critical decisions such as a car deciding to break or not. The model should be confident enough to give the breaking signal. If there is reasonable doubt then the system should rely on other sensors and models to take the critical decision. The same applies to human doctors and support systems. Decision models in medicine should pass the decision making to human doctors when there is not enough confidence that the model is choosing the right treatment. And last but not least it also helps with model interpretability and trustworthiness.

Originally models were well-calibrated [1] but it seems that the calibration in newer models is less reliable. In this post I do not aim to point architecture flaws. They fall out of the scope. Instead, I want to point ways of visualizing, assessing, and improving the calibration of deep learning models for your architectures. These approaches are framework and model agnostic, meaning that they work in TensorFlow, Keras, PyTorch, or whatever framework you would like to use but also regardless of the architecture model implemented.

The next pair of images show what we want in this post. We want to go from an uncalibrated model to a calibrated one.

From this to this
Poorly calibrated deep learning model example Well calibrated deep learning model example
Figure 1: The ECE stands for Expected Calibration Error (the lower the better). The blue bars are counts of occurrences in a certain bin, and the red area is the difference between expected occurrences and actual ones. So we can see that the first image shows a poorly calibrated model with high ECE (consequently lots of red) whereas the second plot there is very little red and the ECE is very low.

To visually inspect the calibration of the model we can plot a reliability diagram (Figure 1). A reliability diagram plots the expected sample accuracy as a function of the confidence. Any deviation from the diagonal represents a miscalibration. Since we probably don’t have infinite samples to compute the calibration as a line, we need to first bin the predictions into M bins. Then for each prediction in the bin, we need to assess whether the predicted label corresponds to the true label or not and divide the result by the number of predictions within the bin. For those of you that like formulas:

Formula for the reliability diagram for each bin

The Bm are the set of samples whose predicted confidence falls into the interval of the desired bin (in this case, the bin m). Both, y and ลท, are the predicted label and the true label. Notice that in the reliability plot there is no display on the number of samples. Therefore, we cannot estimate how many samples are correctly placed in a bin or the overall model calibration. For that, we need a measure that is able to summarize the calibration of the model into a single scalar value. There are several formulas to compute the general model calibration. In this post, we’ll see the Expected Calibration Error (ECE) [2] which is the one displayed in the images above.

ECE follows on the intuition of the bar plot. It takes the weighted average of the difference between accuracy and confidence for each of the bins. The confidence of a bin is the desired value and is obtained by getting the left and right sides of the bin and computing the average (i.e. dividing it by 2). And the accuracy is the formula presented in the previous paragraph. Then we compute the difference between accuracy and confidence, we multiply it for the number of samples in the bin and divide it by the total number of samples. So formula:

Expected calibration error formula

Now after we do this, we can visualize and assess numerically how good the calibration of our model is. We can also use the sklearn library to plot the calibration curve which is the equivalent of the bins approach we did above where a perfectly calibrated model should produce a line that fits x=y. We can use the sklearn library and plot the curve using the following example:

If you made it this far you may be thinking: Now I know if the model is well-calibrated or not but I have no clue how to fix it. So today is your lucky day! Platt scaling [3] is a parametric approach to calibration. The non-probabilistic predictions (a.k.a. the predicted probabilities of your model) of a classifier are used as features
for a logistic regression model trained on the validation set to return probabilities. At this stage we don’t train the model anymore, the parameters are fixed.

All of these concepts and gists are going to provide you with enough knowledge to assess the model calibration and correct it if necessary. But it shall not affect much other metrics such as accuracy or ROC AUC.

References:

Main paper: Guo, Chuan, et al. “On calibration of modern neural networks.” Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR. org, 2017.

[1] Niculescu-Mizil, Alexandru and Caruana, Rich. Predicting good probabilities with supervised learning. In ICML, pp. 625โ€“632, 2005

[2] Naeini, Mahdi Pakdaman, Cooper, Gregory F, and Hauskrecht, Milos. Obtaining well calibrated probabili- ties using bayesian binning. In AAAI, pp. 2901, 2015.

[3] Platt, John et al. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Advances in large margin classifiers, 10(3): 61โ€“74, 1999.

Worth mentioning II: PhD survival guide, ergodicity, death after 75, content marketing, and email marketing

This is the continuation of the worth mentioning experiment where I shared links that I deemed important to be kept for future references.

A graduate school survival guide: “So long, and thanks for the Ph.D!: On this essay Ronald Azuma states the path to a successful PhD defense. He states a few things that may seem obvious be we often forget. He is not arguing in favor of a PhD title, he simply states that what works for some may not work for others.

The Logic of Risk Taking: I wanted to include this link to remember the term ergodicity explained by Nassim Taleb. An ergodic system has the same behavior averaged over time as averaged over the space. A non-ergodic system is when observed past probabilities do not apply to future processes (Russian roulette).

A doctor and medical ethicist argues life after 75 is not worth living: This is an interview of Ezekiel Emanuel where he argues that trying to keep people alive past 75 is a waste of resources that bring little to the society. Instead, all of those resources should be invested in the younger generations to live a better life. Controversial argument but worth reading it.

Guide for content marketing in 2020.Very well researched and interesting guide for content marketing in 2020. It contains information from email marketing and video marketing to engagement and trends. It doesn’t disappoint at all.

Marketing lifecycle applied to newsletters. In this excellent post, the author maps the different phases of marketing to email marketing. It starts with awareness, then engagement, purchase, retention, growth, and advocacy.

Europe can do better and you know it

Europe for many centuries has been the center of the world. The cradle of the current civilization and the spark that ignited the fire of industrialization. We got into a position where we got comfortable. Across many centuries European countries colonized the world leaving a huge mark still noticeable nowadays. But we got complacent. Luckily we didn’t come to a grinding halt. It takes time to slow down. We were the first countries to be industrialized and keep progressing at a high peace but we are now lacking behind. There is still some advantage and inertia that keeps us in a privileged position but this will end unless we change. Change for the better.

I don’t buy it that Europe is inherently at disadvantage due to the diverse internal variety of cultures (link in spanish). And I certainly don’t buy that Europe should look more like America (link in spanish). I believe that being like China is also out of the picture in most people’s minds. So then what? Shall we die from a thousand cuts? I don’t think so. I believe Europe has enough steam (pun intended) to compete. We have enough brainpower to find the European way.

China is killing it. They have massive tech companies that reach massive numbers of consumers. Their government legislates so it is easier to obtain certain advantages for their local companies. For AI companies it is relatively easy to obtain incredibly large datasets. Also keep in mind that some of the most relevant hardware companies are Chinese (e.g. Huawei, which not only has cellphones but also infrastructure hardware). In the meanwhile, USA is brainwashing the European population. Not only through TV and cinema. But though technology. Most of the software also comes from the new world (especially social networks and operative systems) they own the platforms. At the same time, they hire smart and highly motivated people. Thus extracting value twice.

It’s time to start writing our own playbook and kill it. Reinforcement loops and inertia are there. We can become a tech powerhouse and stop depending on China and the USA. Silicon Valley was developed over many years thanks to the department of defense and Standford. China did it by themselves based on entrepreneurship and federal advantages. Europe so far seems to be focusing to use the bureaucratic way. Spoiler: it is not working. GDPR and other brain farts hurt more the internal system more than the external forces. Hostile companies are big and can endure these changes at ease, but small companies and startups are doomed. The impacts are much larger for them. And then, when it comes to being serious the justice chickens out. It should have been an unmerge. And the same goes for Google.

It’s good that there are regulations that prevent some external hostilities. But at the same time, Europe should seriously invest in European forces that can really occupy the market space. And this is a strategic point much needed and apparently ignored. We cannot rely on external companies for such strategic markets. Specially after seeing the Russian inference (because USA didn’t use similar approaches #irony). At the same time, we should aim to extend our reach and not limit ourselves to the internal market. In this way, we make sure that our tech powerhouse is competitive. Otherwise, we will fail in protectionism and conformity without adding value. Like Studwell suggest in his book.

In conclusion, we should learn to look inwards and acknowledge that we have the material and the right tools. We don’t need to look outwards. We’re thw powerhouse of many industries and we can keep it like that. It is time to rise and shine.