A Brief Introduction to Machine Learning for Engineers (Revised)

A revised version of my notes on machine learning can be found here. I am grateful for all the comments received on the first version, and I welcome further feedback.


Turing, London, and Information Theory

In this one-page contribution to the first London Symposium on Information Theory, Alan Turing discusses learning as opposed to programming, the role of computational complexity in information theory, and genetic algorithms — in 1950.

(with thanks to Deniz Gündüz)

The Rise of Hybrid Digital-Analog

Asautonomous_design-by-will-staehle a keen observer of nature, Leonardo da Vinci was more comfortable with geometry than with arithmetic. Shapes, being continuous quantities, were easier to fit, and disappear into, the observable world than discrete, discontinuous, numbers. For centuries since Leonardo, physics has shared his preference for analog thinking, building on calculus to describe macroscopic phenomena. The analog paradigm was upended at the beginning of the last century, when the quantum revolution revealed that the microscopic world behaves digitally, with observable quantities taking only discrete values. Quantum physics is, however, at heart a hybrid analog-digital theory, as it requires the presence of analog hidden variables to model the digital observations.

Computing technology appears to be following a similar path. The state-of-the-art computer that Claude Shannon found in Vannevar Bush‘s lab at MIT in the thirties was analog: turning its wheels would set the parameters of a differential equation to be solved by the computer via integration. Shannon’s thesis and the invention of the transistor ushered in the era of digital computing and the information age, relegating analog computing to little more than a historical curiosity.

But analog computing retains important advantages over digital machines. Analog computers can be faster in carrying our specialized tasks. As an example, deep neural networks, which have led to the well-publicized breakthroughs in pattern recognition, reinforcement learning, and data generation tasks, are inherently analog (although they are currently mostly implemented on digital platforms). Furthermore, while the reliance of digital computing on either-or choices can provide a higher accuracy, it can also also yield catastrophic failures. In contrast, the lower accuracy of analog systems is accompanied by a gradual performance loss in case of errors. Finally, analog computers can leverage time, not just as a neutral substrate for computation as in digital machines, but as an additional information-carrying dimension. The resulting space-time computing has the potential to reduce the energetic and spatial footprint of information processing.

The outlined complementarity of analog and digital computing has led experts to predict that hybrid digital-analog computers will be the way of the future.  Even in the eighties, Terrence J. Sejnowski is reported to have said:  ”I suspect the computers used in the future will be hybrid designs, incorporating analog and digital.” This conjecture is supported by our current understanding of the operation of biological neurons, which communicate using the digital language of spikes, but maintain internal analog states in the form of membrane potentials.

With the emergence of academic and commercial neuromorphic processors, the rise of hybrid digital-analog computing may just be around the corner. As it is often the case, the trend has been anticipated by fiction. In Autonomous, robots have a digital main logic unit with a human brain as a coprocessor to interpret people’s reactions and emotions. Analog elements can support common sense and humanity, in contrast to digital AI that “can make a perfect chess move while the room is on fire.” For instance, in H(a)ppy and Gnomon, analog is an element of disruption and reason in an ideally ordered and purified world under constant digital surveillance.

(Update: Here is a recent relevant article.)

Impossible Lines

deep_face_1000In a formal field such as Information Theory (IT), the boundary between possible and impossible is well delineated: given a problem, the optimality of a solution can be in principle checked and determined unambiguously. As a pertinent example, IT says that there are ways to compress an “information source”, say a class of images, up to some file size, and that no conceivable solution could do any better than the theoretical limit. This is often a cause of confusion among newcomers, who tend to more naturally focus on improving existing solutions — say on producing a better compression algorithm as in “Silicon Valley” — rather than asking if the effort could at all be fruitful due to intrinsic informational limits.

The strong formalism has been among the key reasons for the many successes of IT, but — some may argue — it has also hindered its applications to a broader set of problems. (Claude Shannon himself famously warned about an excessively liberal use of the theory.) It is not unusual for IT experts to look with suspicion at fields such as Machine Learning (ML) in which the boundaries between possible and impossible are constantly redrawn by advances in algorithm design and computing power.

In fact, a less formal field such as ML allows practice to precede theory, letting the former push the state-of-the-art boundary in the process. As a case in point, deep neural networks, which power countless algorithms and applications, are still hardly understood from a theoretical viewpoint. The same is true for the more recent algorithmic framework of Generative Adversarial Networks (GANs). GANs can generate realistic images of faces, animals and rooms from datasets of related examples, producing fake faces, animals and rooms that cannot be distinguished from their real counterparts. It is expected that soon enough GANs will even be able to generate videos of events that never happenedwatch Françoise Hardy discuss the current US president in the 60’s. While the theory may be lagging behind, these methods are making significant practical contributions.

Interestingly, GANs can be interpreted in terms of information-theoretic quantities (namely the Jensen-Shannon divergence), showing that the gap between the two fields is perhaps not as unbridgeable as it has broadly assumed to be, at least in recent years.

The Network & the Network

Full Narrative Timeline

In “The City & the City“, China Miéville imagines an usual coexistence arrangement between two cities located in the same geographical area that provides a surprisingly apt metaphor for the concept of network slicing in 5G networks — from the city & the city to the network & the network.

The two cities: Besźel and Ul Qoma occupy the same physical location, with buildings, squares, streets and parks either allocated completely to one city or “crosshatched”, that is, shared. The separation and isolation between the two cities is not ensured by physical borders, but is rather enforced by cultural customs and legal norms. The inhabitants of each city are taught from childhood to “unsee” anything that lies in the other city, consciously ignoring people, cars and buildings, even though they share the same sidewalks, roads and city blocks. Recognition of “alter” areas and citizens is made possible by the different architectures, language and clothing styles adopted in the two cities. Breaching the logical divide between Besźel and Ul Qoma by entering areas or interacting with denizens of the other city is a serious crime dealt with by a special police force. (Prospective tourists in Besźel or Ul Qoma are required to attend a long preliminary course to learn how to “unsee”.)

And now for the two networks: Experts predict an upcoming upheaval in telecommunication networks to parallel the recent revolution in computing brought on by cloudification. Just as computing and storage have become readily available on demand to individuals, companies and governments on shared cloud platforms, network slicing technologies are expected to enable the on-demand instantiation of wireless services on a common network substrate. Networking and wireless access for, say, a start-up offering IoT or vehicular communication applications, could be quickly set up on the hardware and spectrum managed by an infrastructure provider. Each service would run its own network on the same physical infrastructure but on logically separated slices — the packets and signals of one slice “unseeing” those of the other. In keeping with the metaphor, ensuring the isolation and security of the coexisting slices is among the key challenges facing this potentially revolutionary technology.


Net Neutrality vs Net Vitality (and 5G)

simpledesktops.com.pngA prime example of the complex relationship between digital technologies and the legal system is the fluidity and geographical variance of the laws that regulate broadband access. The discussion is typically framed — as far as I can tell from my outsider’s perspective — around two absolute principles, namely network neutrality and networks vitality. The net neutrality and net vitality camps, at least in their purest expressions, often seem uninterested in hearing each other’s arguments. This tends to hide from public discussion the layered technological, economic, moral and legal aspects that underlie the delicate balance between access and economic incentives that is at the core of the issue. And things appear to be getting even more involved with the advent of 5G.

Net neutrality is — for purists — the principle that all bits are created equal. Accordingly, broadband access providers should not be allowed to “throttle” packets on the basis, for instance, of their application (e.g., BitTorrent) or their origin (as determined by the IP address). The network should be “dumb” and only convey bits from two ends of a communication session. Regulation that upholds net neutrality rules is in place in many countries around the word, including in the EU and the US. Under the previous US administration, the FCC reclassified broadband Internet access as a “common carrier”, that is, as a public utility, in 2015, allowing the enforcement of net neutrality rules. Under the new administration, this decision now appears likely to be reversed.

The counterarguments to net neutrality typically center around some notion of net vitality, which refers broadly to the dynamism of the broadband Internet ecosystem, particularly as it pertains investment and growth. The term was coined in a report by the Media Institute, where a quantitative index was proposed as a compound measure of the net vitality of a country in terms of applications and content (e.g., access, e-government, social network penetration, app development),  devices (e.g., smart phone penetration and sales), networks (e.g., cybersecurity, investment, broadband  prices), and macroeconomic factors (e.g., number and evaluation of start-ups).

Net neutrality purists — not all advocates fall in this category — believe that allowing broadband access providers to discriminate on the basis of a packet’s identity would pose a threat to freedom of expression and competition. Without net neutrality rules, telecom operators could in fact block competitors’ services, and also favor deep-pocketed internet companies, such as the Frightful Five (Alphabet, Amazon, Apple, Facebook and Microsoft), that can outspend start-ups for faster access. A case in point is the ban of Google Wallet by Verizon Wireless, AT&T, and T-Mobile to promote their competing Isis (!) mobile payment system.

The net vitality camp, headed by broadband access providers and economists, deems net neutrality rules to be an impediment to investment and growth. As claimed in a 2016 manifesto by European telecom operators, only by charging more for better service can sufficient revenue be raised by broadband access providers to fund new infrastructure and services.

Digging a little deeper, one finds that the issue is more complex than implied by the arguments of the two camps. To start, some discrimination among the bits carried by the network may in fact serve a useful purpose. For instance, by letting some packets be transported for free, telecom operators can offer zero-cost Internet access to the poorest communities in the developing world  as in the Facebook Zero and Google Free Zone projects. And packet prioritization is in fact already implemented in LTE networks as a necessary means to ensure call quality for Voice over LTE (VoLTE is not considered to be a broadband Internet service and hence not subject to net neutrality regulations).

That net neutrality is a more subtle requirement that the “every bit is created equal” mantra is in fact well recognized by many net neutrality advocates. When making the case for net neutrality rules, the then-president Obama called for “no blocking, no throttling, no special treatment at interconnections, and no paid prioritization to speed content transmission”, hence stopping short of prescribing full bit equivalence. Tim Berners-Lee and Eric Shmidt have also voiced similar opinions.

The planned transition to 5G systems is bound to add a further layer of complexity to the relationship between net neutrality and net vitality. 5G networks are indeed expected not only to provide broadband access, but also to serve vertical industries through the deployment of ultra-reliable and low-latency communication services. In this context, it seems apparent that bits carrying information about, say, a remote surgery or the control of a vehicle, should not be treated in the same way as bits encoding an email.

As the example of VoLTE shows, a general solution may lie in isolating mobile broadband services, on which strong net neutrality guarantees can be enforced, from other types of traffic, such as ultra-reliable and machine-type communications, on which traffic differentiation may be allowed. The feasibility of this approach is reinforced by the fact that isolation is a central feature of network slicing, a technology that will allow operators of 5G to create virtual networks that are fine-tuned for specific applications.


Information, Knowledge, Wisdom and 5G

bestdoc-535x300One of the most compelling conceptual visions for 5G contrasts the user-driven information-centric operation of previous generations with the industry-driven knowledge-centric nature of the upcoming fifth generation. According to this vision, the evolution from 1G to 4G has been marked by the goal of enhancing the efficiency of human communication — with end results that we are still trying to understand and manage. In contrast, 5G will not be aimed at channeling tweets or instantaneous messages for human-to-human communications, but at transferring actionable knowledge for vertical markets catering to the healthcare, transportation, agriculture, manufacturing, education, automation, service and entertainment industries. In other words, rather than carrying only information, future networks will carry knowledge and skills. Whose knowledge and whose skills will be amplified and shared by the 5G network infrastructure?

Two options are typically invoked: learning machines (AI) and human experts. AI is widely assumed to be able to produce actionable knowledge from large data sets solely for tasks that require systematic, possibly real-time, pattern recognition and search operations. Typical examples pertain the realm of the Internet of Thing, with data acquired by sensors feeding control or diagnostic mechanisms. AI is, however, still very far from replicating the skills of human experts when it comes to “instinctive intelligence“, making multi-faceted judgements  based on acquired “wisdom“, innovation, relating to other humans, providing advice, offering arguments, and, more generally, performing complex non-mechanical tasks. Therefore, human experts can complement the knowledge and skills offered by AI. A scenario that is consistently summoned is that of a surgeon operating on a patient remotely thanks to sensors, haptic devices and low-latency communication networks.

By sharing knowledge and skills of AI and human experts, 5G networks are bound to increase the efficiency and productivity of learning machines and top professionals, revolutionizing, e.g., hospitals, transport networks and agriculture. But, as a result, 5G is also likely to become a contributor to the reduction of blue– and white-collar jobs and to the widening income gap between an educated elite and the rest. This effect may be somewhat mitigated if more optimistic visions of a post-capitalist economic system, based on sharing and collaborative commons, will be at least partly realized thanks to the communication substrate brought by 5G.

It from Bit

6261055049_26244e9348_bIn most classes on information theory (IT), the relationship between IT and physics is reduced to a remark on the origin of the term “entropy” in Boltzmann’s classical work on thermodynamics. This is possibly accompanied by the anecdote regarding von Neuman’s quip on the advantages of using this terminology. Even leaving aside recent, disputed, attempts, such as constructor theory (see here) and integrated information theory (see here), to use concepts from IT as foundations for new theories of the physical world, it seems useful to provide at least a glimpse of the role of IT in more mainstream discussions on the future of theoretical physics.

As I am admittedly not qualified to provide an original take on this topic, I will rely here on the poetic tour of modern physics by Carlo Rovelli, in which one of the last chapters is tellingly centered on the subject of “information”. Rovelli starts his discussion by describing information as a “specter” that is haunting theoretical physics, arousing at the same time enthusiasm and confusion. He goes on to say that many scientists suspect that the concept of information may be essential to make progress in theoretical physics, providing the correct language to describe reality.

At a fundamental level, information refers to a correlation between the states of two physical systems. A physical system, e.g., one’s brain, has information about another physical system, e.g., a tea cup, if the state of the tea cup is not independent of that of the neurons in the brain. This happens if a state of the tea cup, say that of being hot, is only compatible with a subset of states of the brain, namely those in which the brain has memorized the information that the tea cup is hot. Reality can be defined by the network of such correlations among physical systems. In fact, nature has evolved so as to manage these correlations in the most efficiency way, e.g., through genes, nerves, languages.

The description of information in terms of correlation between the states of physical systems is valid in both classical and quantum physics. In thermodynamics, the missing information about the microstate of a system, e.g., about the arrangement of the atoms of a tea cup, given the observation of its macrostate, e.g., its temperature, plays a key role in predicting the future behavior of the system. This missing information is referred to as entropy. In more detail, the entropy is the logarithm of the number of microstates that are compatible with a given macrostate. The entropy tends to increase in an isolated system, as information cannot materialize out of thin air and the amount of missing information can only grow larger in the absence of external interventions.

In quantum physics, as summarized by Wheeler’s “It from Bit” slogan, the entire theoretical framework can be largely built around two information-centric postulates: 1) In any system, the “relevant” information that can be extracted so as to make predictions about the future is finite; 2) Additional information can always be obtained from a system, possibly making irrelevant previously extracted information (to satisfy the first postulate).

The enthusiasm and confusion aroused by the concept of information among theoretical physicists pertain many fundamental open questions, such as: What happens to the missing information trapped in a black hole when the latter evaporates? Can time be described, as suggested by Rovelli, as “information we don’t have”? Related questions abound also in other scientific fields, such as biology and neuroscience: How is information encoded in genes? What is the neural code used by the brain to encode and process information?


Spiking Neural Networks and Neuromorphic Computing

Brain_Chip_Wide.jpgDeep learning techniques have by now achieved unprecedented levels of accuracy in important tasks such as speech translation and image recognition, despite their known failures on properly selected adversarial examples. The operation of deep neural networks can be interpreted as the extraction, across successive layers, of approximate minimal sufficient statistics from the data, with the aim of preserving as much information as possible with respect to the desired output.

A deep neural network encodes a learned task in the synaptic weights between connected neurons. The weights define the transformation between the statistics produced by successive layers. Learning requires updating all the synaptic weights, which typically run in the millions; and inference on a new input, e.g., audio file or image, generally involves computations at all neurons. As a result, the energy required to run a deep neural network is currently incompatible with an implementation on mobile devices.

The economic incentive to offer mobile users applications such as Siri has hence motivated the development in recent years of computation offloading schemes, whereby computation is migrated from mobile devices to remote servers accessed via a wireless interface. Accordingly, user’s data is processed on servers located within the wireless operator’s network rather than on the devices. This reduces energy consumption at the mobiles, while, at the same time, entailing latency — a significant issue for applications such as Augmented Reality — and a potential loss of privacy.

The terminology used to describe deep learning methods — neurons, synapses — reveals the ambition to capture at least some of the brain functionalities via artificial means. But the contrast between the apparent efficiency of the human brain, which operates with five orders of magnitude (100,000 times) less power than current most powerful supercomputers, and the state of the art on neural networks remains jarring.

Current deep learning methods rely on second-generation neurons, which consist of simple static non-linear functions. In contrast, neurons in the human brain are known to communicate by means of sparse spiking processes. As a result, neurons are mostly inactive and energy is consumed sporadically and only in limited areas of the brain at any given time. Third-generation neural networks, or Spiking Neural Networks (SNNs), aim at harnessing the efficiencies of spike-domain processing by building on computing elements that operate on, and exchange, spikes. In an SNN, spiking neurons determine whether to output a spike to the connected neurons based on the incoming spikes.

Neuromorphic hardware is currently being developed that is able to natively implement SNNs. Unlike traditional CPUs or GPUs running deep learning algorithms, processing and communication is not “clocked” to take place across all computing elements at regular intervals. Rather, neuromorphic hardware consists of spiking neurons that are only active in an asynchronous manner whenever excited by input spikes, potentially increasing the energy efficiency by orders of magnitude.

If the promises of neuromorphic hardware and SNNs will be realized and neuromorphic chips will find their place within mobile devices, we could soon see the emergence of revolutionary new applications under enhanced privacy guarantees.