Machine Learning, AI and Big Data Tools Open-Sourced By Major Corporations
The goal of this article is to provide an overview of frameworks relevant to Machine Learning and Artificial Intelligence released by large corporations. We focus not just on pure Machine Learning and AI tools but also include some Big Data frameworks which provide value in making Machine Learning and AI available at scale. While these releases do have very strategic business reasons, there is no doubt that the trend of open-sourcing internal tools is adding value and making Machine Learning and AI more accessible.
Over the past 2-3 years a large number of frameworks have been open-sourced. Companies may wish to establish standards, showcase their advanced level of research, attract talent or leverage the power of a community when open-sourcing tools. Whatever the reasons may be for open-sourcing tools, large organizations tend to have extensive resources which they use to build their internal tools. For businesses interested in exploring Data Science it only makes sense to evaluate whether any effort that has already gone into building these frameworks can be leveraged. We provide a summary of released tools, but not a comparison of the individual frameworks. Especially when it comes to Deep Learning, entire communities have formed around tools, and with that very dedicated fans and opponents. While we avoid such discussions, we provide our own observations and conclude the article with some generic guidelines for evaluating frameworks for business use.
The following is a list of released tools by company. In some instances the lists may not be complete but rather narrowed to tools we found to be relevant.
Google released TensorFlow in 2015. Described as a Machine Intelligence framework, it was originally developed for Deep Neural Networks. While it is primarily written in C++ and Python it provides APIs for multiple languages. TensorFlow can run in a parallel mode and leverage graphical processing units. According to Google, TensorFlow has been internally applied to a variety of tasks such as Image Captioning, Search, OCR, and Language Translation. In 2014 Google acquired DeepMind, a startup focused on Deep Learning and Artificial General Intelligence. DeepMind is known for building systems to compete and win in AlphaGo games and for reducing the power consumption across Google’s data centers by applying Artificial Intelligence. DeepMind has moved their entire infrastructure to run on TensorFlow. In addition they have released their own tool for constructing Deep Neural Networks on top of TensorFlow called Sonnet. DeepMind has also open-sourced a framework called DeepMind Lab. It is a game-like environment tailored for agent-based learning.
- TensorFlow: Deep Learning Framework
- Sonnet: Framework to Construct Neural Networks on top of TensorFlow DeepMind
- Lab: Testbed for Agent-based research
Microsoft released CNTK (Computational Network Toolkit) in 2016, it was later renamed to Microsoft Cognitive Toolkit. The toolkit is a Deep Learning Framework written in C++ with a Python interface. It is internally used by Microsoft for various services such as Cortana, Skype, Bing and Xbox. It also appears to be significantly faster than TensorFlow. In addition to the Deeep Learning framework Microsoft has released a toolkit for distributed machine learning (DMTK), a framework to facilitate Machine Learning at scale.
- Microsoft Cognitive Toolkit: A Framework for Deep Learning
- Distributed Machine Learning Toolkit (DMTK): A Framework to Provide Machine Learning at Scale
The Facebook AI Research group (FAIR) has been one of the most active in publishing and open-sourcing various projects over the recent years. Their Director of AI research is Yann LeCun, a well-known and respected researcher in the Machine Learning community who has made significant contributions to the field. Facebook has released a Deep Learning Module for Torch, and recently released Caffe2 in conjunction with NVIDIA. Both are frameworks for Deep Learning. Facebook has also released a Computer Vision pipeline consisting of DeepMask, SharpMask and MultiPathNet projects. These are projects that deal with image segmentation, object detection and classification. The pipeline can be used to train a model to interpret images. In addition Facebook has released projects that deal with text and understanding of language. FastText can be used to generate vector representations of words, text classification and for efficient sentiment analysis among other things. Facebook also contributed a Question and Answer system called bAbl.
- DeepMask, SharpMask, MultiPathNet: Computer Vision Pipeline to Segment, Detect and Classify Objects in Images
- FastText: Framework for Generating Word Representations and Text Classification
- bAbl: Question and Answer System
- Torch Deep Learning Module
- Caffe2: Deep Learning Framework
Baidu has also been very active when it comes to Machine Learning. Until recently Andrew Ng, a well-known researcher was working for Baidu as a Chief Scientist. Baidu has also opened up a research lab in Silicon Valley and employs a large number of scientist. In 2016 Baidu open-sourced it’s PaddlePaddle (PArallel Distributed Deep LEarning) framework. It is claimed to be easier to use compared to the other Deep Learning frameworks recently released by other organizations. According to Baidu PaddlePaddle is used for a range of internal services including click-through rate prediction, image classification, optical character recognition (OCR), search and recommendations. Baidu has also announced that it will open-source its self-driving technology named Apollo.
Samsung has decided to deploy an AI-based bot called Bixby on its galaxy S8 phones. It seems that they have committed significant resources to be able to compete with the likes of Siri, Google Assistant, Alexa or Cortana. As far as open-sourced software goes Samsung has released a Deep Learning Framework of their own, called Veles. It’s a distributed framework written in python and provides a repository of already created models (VelesForge).
In 2016 Amazon released the Deep Scalable Sparse Tensor Network Engine (DSSTNE), their own version of a Deep Learning framework. Unlike most of the other Deep Learning frameworks DSSTNE was developed with product recommendations as its main use case. According to Amazon it is used to provide personalized product recommendations across their site. While DSSTNE may not be as popular among developers as TensorFlow, Amazon does have considerable experience and has been applying Machine Learning methods long before Deep Neural Networks were hyped. Alex Smola, a respected Machine Learning researcher has been hired by Amazon to lead Machine Learning efforts.
- Deep Scalable Sparse Tensor Network Engine (DSSTNE): Deep Neural Network Framework with Emphasis on Recommendations
Huawei has been actively building out Machine Learning capabilities and services through it’s Noah’s Ark Lab. They utilize the technology internally and offer it to customers as well. While they have developed a collection of solutions, it appears the only one currently open-sourced is streamDM, a stream processing framework on top of Apache Spark.
IBM open-sourced SystemML in 2015. It is currently an Apache incubator project. SystemML came out of the efforts IBM placed in developing Watson. It introduces a high level language called the Declarative Machine Learning language (DML) which is used to specify Machine Learning algorithms. The intent of the language is to help Data Scientist work more efficiently. Beyond providing DML, SystemML also performs automated optimization when it comes to code execution. SystemML is tightly integrated with Apache Spark and it comes with a collection of ready-to-use algorithms.
- Apache SystemML: A Framework that Improves Efficiency of Data Scientists by Providing a Declarative Machine Learning Language and Automated Optimization
SalesForce has placed an emphasis on AI and is calling its internal AI framework Einstein. While Einstein is not an open source project, SalesForce has donated a project by the name of PredictionIO to the Apache Foundation. SalesForce acquired PredictionIO in 2016. It is a framework for hosting and quickly deploying Machine Learning services. PredictionIO is currently an Apache incubator project. It uses Apache Spark for processing.
- Apache PredictionIO: A Framework that Provides the Capability to Deploy Machine Learning Algorithms as a Service
In 2015 Airbnb released Aerosolve, a machine-learning framework that they use internally to determine what kind of factors to incorporate into their dynamic pricing model. The framework is promoted as being interpretable by humans, in that it is easy to understand the effects of individual factors on pricing and subsequently tune them. This is not a general purpose framework but rather one that is tailored for pricing use cases. Another very interesting open-sourced tool from Airbnb is Superset. It is a data exploration framework that allows data to be visualized in dashboards, queried and sliced. In addition to that Airbnb has released Airflow, an Apache Incubator project, which allows to programmatically create, schedule and monitor workflows. It is written in Python. Lastly Airbnb also released SteamAlert, a real-time data analysis and alerting framework.
- Aerosolve; A Humanly Interpretable Framework for Tuning Factors for a Dynamic Pricing Algorithm
- Superset: A Data Exploration Framework which Allows Easy Querying, Slicing, Visualization and Dash-boarding of Data
- Apache Airflow: A Workflow Management System
- SteamAlert: A Framework for eal-time Data Analysis and Alerting
In 2016 Uber acquired Geometric Intelligence, and formed the Uber AI Labs. While Uber seems to be heavily using Machine Learning for internal purposes, they have not open-sourced any of those algorithms yet. However we have added them to this list because recently they did release Deck.GL 4.0, a visualization framework for large data sets. While the primary purpose of the visualization framework appears to be geospatial exploration, it has been extended to other types of visualizations relevant to Machine Learning. Frameworks that can visualize large quantities of data in a meaningful manner are rare and valuable.
LinkedIn has made some very significant contributions to the open source community over the past years. Two of the most well known contributions are Apache Kafka and Apache Samza, two now very well known Apache projects. Kafka is a distributed streaming platform while Samza is used for processing streams on top of Kafka. These two platforms have played a significant role in making real-time analytics more accessible. When it comes to pure Machine Learning frameworks, LinkedIn has released PhotonML in 2016. It is a framework built on top of Apache Spark, and it is refreshing in that it is not yet another Deep Learning framework. It contains a number of Generalized Linear Models, and the intent is to move towards Generalized Additive Mixed Effect Models. To put this in simple terms, the algorithms in this framework are scalable and can be easily repurposed for providing all sorts of recommendations, including product recommendations. Beyond PhotonML notable contributions include Apache DataFu, a collection of libraries to work with large data sets on Hadoop and Azkaban, a workflow job scheduler for Hadoop.
- Photon ML: A Spark-Based Machine Learning Framework with a Number of Scalable and Useful Algorithms
- Apache Kafka and Apache Samza: Streaming and Stream Processing Frameworks
- Apache DataFu : Collection of Libraries for Working with Large Data Sets on Hadoop.
- Azkaban: Batch Workflow Scheduler for Hadoop
Twitter has been very active in the open source community, not just when it comes to Machine Learning, but software development in general. Twitter has contributed an efficient algorithm for computing thresholded all-pairs-similarity in the Apache Spark MLib library. They have also released Summingbird, a framework which allows MapReduce jobs to be run on streaming frameworks such as Storm or Scalding. Twitter also released Heron, a real-time analytics framework. It is a successor to Apache Storm, which introduces improvements while providing backwards compatibility. In addition, Twitter open-sourced an R package called AnomalyDetection and made several contributions to Torch. For example Torch IPC is an open-source module for enabling parallel high performance computing on Torch.
- Apache Spark Mlib: All-Pairs-Similarity Implemented in RowMatrix
- Summingbird: A framework Allowing Map Reduce Jobs on Streaming Frameworks Such as Storm and Scalding
- Heron: A Real-Time Analytics Framework
- AnomalyDetection: R-based Package for Anomaly Detection
- Torch Twrl, Torch IPC, Torch Distlearn – Various Contributions to Torch
OpenAI (Non-profit worth mentioning)
OpenAI is not a large corporation, but it is worth mentioning. Elon Musk is one of its co-founders and it has significant financial resources at its disposal. OpenAI’s mission is to advance research into safe General Artificial Intelligence. They have open-sourced a testbed for reinforcement learning called OpenAI Gym. And a system called Universe, consisting of multiple Gym environments with the purpose of evaluating the general intelligence of an agent. While Gym may appear similar in its purpose to the framework released by DeepMind, OpenAI does not seem to be as narrowly focused on Deep Learning. OpenAI engages in a breath of research topics. It is an organization worthwhile following.
- OpenAI Gym: An Agent-based Environment for Evaluating Reinforcement Learning Algorithms
- OpenAI Universe: An Environment for Evaluating the General Intelligence of an Agent
In terms of specific types of methods, Deep Learning appears to be the most released topic with altogether six companies releasing frameworks. (Google, Microsoft, Facebook, Amazon, Samsung, Baidu). This is not coincidental, since all of these organizations have internal applications that can benefit from scalable Deep Neural Networks. Typically they provide either cloud-based Machine Learning offerings (AWS, Google, Azure), voice-based assistants (Cortana, Google Assistant, Alexa, BixBy), image processing (Google, Baidu, Microsoft) or Search (Google, Microsoft, Baidu). The emphasis on Deep Neural Networks is having a dual effect. On the one side it is reinforcing the hype surrounding Deep Learning and the misconceptions that come with that hype. On the other side it is providing use cases and drawing attention to Deep Learning, and helping fuel new research in the field
The added attention to Deep Learning and increased public awareness is helping to generate new funding for Machine Learning research. That in turn is likely to lead to more advancements. It is true that there are countless businesses that are jumping on the bandwagon, not really understanding what needs to be done. They tend to be under the impression that Deep Learning is like an artificial human brain. A thing of the future. On the other extreme of the spectrum we may find those that may well be knowledgeable, but will dismiss the recent developments as a re-branding of Neural Networks without studying Deep Learning in detail.
The truth lies somewhere in-between. Advances in large-scale computing have opened up the door to utilizing Neural Networks that are wider and deeper than ever before. Beyond the computational aspects, advances in training and applying Neural Networks have steadily been made. A quick search of Deep Learning literature reveals a wealth of publications. These advancements might not be revolutionary, but they are certainly non-negligible. Finding new application areas for Deep Neural Networks is also valuable progress. Beyond that it is very exciting to learn how far the envelope can be pushed when it comes to Deep Neural Networks. We expect the added attention to help drive new research. Deep Neural Networks as they are used nowadays are definitely the product of ongoing research, and not just a re-branded relic from the past.
Another frequent misconception is that Deep Learning is equivalent to Deep Neural Networks. Some go even further and equate Artificial Intelligence to Deep Neural Networks. Currently Deep Neural Networks are the most prominent use case of Deep Learning. However the concept of Deep Learning is much broader. It also encompasses probabilistic methods which use hidden (or latent) representations of data across multiple levels. If we think of Deep Learning in this way it is a very broad and powerful concept that includes a wide variety of Machine Learning models. Some of the released Deep Learning frameworks are also generic in this sense, and can be used for models which are not Neural Networks. While Deep Learning is a powerful concept, It should also be noted that both Machine Learning and Artificial Intelligence are very broad fields and are not limited to Deep Learning.
Overall any releases of new frameworks are exciting developments. Based on their activity it seems Facebook, LinkedIn, Twitter and Airbnb are currently the most active in contributing open-source software that can be be of use to Machine Learning. Looking at the git repositories we can even find code for specific publications in some cases. This openness is to be commended and we certainly hope that this trend will continue. There have been some announcements that Apple will also start to open-source their technology, Beyond that we are awaiting Baidu’s announced release of its self-driving technology.
There is the Cutting Edge, and Than There is the Other Edge…
All of the released frameworks provide value in some way. However their applicability and potential value for business use differs depending on the nature of the business and their Data Science maturity. Organizations that are heavily involved in Machine Learning R&D tend to have in-house expertise are likely to be better positioned to evaluate any frameworks that are released.
Most of the above organizations that are releasing Machine Learning frameworks are rather mature when it comes to Data Science. Their executives emphasize the value of Machine Learning and Artificial Intelligence. Whether it is Jeff Bezos, Elon Musk, Mark Zuckerberg or Satya Nadella the mandate to exploit Machine Learning and Artificial Intelligence comes from the very top of the organization. Specialized executives with both a Business background and a Data Science background are hired to help execute a strategy.
Organizations that are just in the initial stages of applying Data Science, are more susceptible to pursue frameworks without truly understanding the business value and how to systematically evaluate them. The following are four types of organizational issues that can lead to problems in practice:
1. The Bandwagon Ride
Most business don’t necessarily have a focus on Machine Learning and Data Science that is mandated from the very top. Even if they do their maturity level with respect to Data Science tends to be much lower. They tend to be aware that there is potential value to be gained, but they do not have the management structure or experience to execute it. Larger organizations can have a chain of management that is more than 5 levels deep. Even if Data Scientists have been hired at the bottom level, if there is a missing pathway to the strategic decision maker problems can occur, and the company might end up chasing a bandwagon ride, rather than the execution of a carefully planned strategic plan. This can especially occur when between lower-level managers and top-level executives there is nobody with a Data Science background. While executives certainly do not have to be experts on everything, problems arise when in this vacuum of expertise decisions are made on Data Science strategy.
2. The All-Purpose Hammer
As organizations start applying Data Science, it usually happens within existing structures. Departments that may have a need start applying methods. Once the realization kicks in that this may be something of high value to the organization, management structures are set up. Organizations may have familiarity with Software Development, so there is a tendency to re-purpose the same management structures for Data Science, in the hopes that those prior experiences with Software Development will pay off. The only problem is that Data Science is not Software Development. Someone with a technical background is more likely to learn about Data Science, however applying the same methodologies from Software Development to Data Science can be problematic.
3. The Fog of Delegation
Another frequent pit fall in executing a Data Science strategy can be unclear delegation of responsibility. A strategy at the top level is necessary. It is just as important to have a very clear separation of responsibilities all the way down to the Data Scientist. If it is not clear who owns what, especially in larger organizations, it can lead to competing views and plans with no clear chain of command. Departments or entire teams end up competing with each other and duplicating efforts in a way that is not very healthy for the organization as a whole. With no clear separation of responsibilities, problems usually persist until a necessary reorganization is executed.
4. The Missing Link
The missing link in the chain of mangement is for us a gap of knowledge, that hurts the execution of a strategy. The typical scenario unfolds as follows: Executive with pure business background hires Machine Learning expert to advise them. The thinking is the business expert takes care of the business strategy, the Machine Learning expert answers questions about Machine Learning. But problems arise when the executive has no background in Data Science, and the Data Science expert has no background in business. There is nobody there to bridge the two worlds, and that is the missing link. The advise that is really needed is strategic in nature, on how to execute a plan and not on the merits of various methods or frameworks. This problem of a missing link can occur at any level in the chain of management. The likelihood of success is enormously increased if experts with overlapping backgrounds are used to bridge the gap in understanding. A chief data scientist should be an executive that can communicate to the executives and data scientists alike. The further one goes down in the management chain, the mix of expertise can be tilted more towards Data Science. The overlap in expertise in the management chain helps to make sure that strategic goals are informed by Data Science, and that the strategy itself is executed properly.
Given these organizational challenges, companies with a low level of Data Science maturity can have a hard time quantifying the usability and value of any framework that might be released. Instead the releases end up serving more as a way to generate a perception of technological superiority. To executives that seek risk adversity it may make a difference in deciding whether to engage with one vendor or another. It gets really interesting when those commitments are made purely based on perception and marketing, and devoid of any expertise or applicability to their own business.
The point is, while the released frameworks are valuable, whether or not that value can really be leveraged ends up being a complicated question, which in many cases is driven by political and organizational issues and not pure Data Science factors.
Guidelines For Businesses with Low-Level Data Science Maturity
These are some useful questions to ask when considering an open-sourced framework.
1. How much Research is Required to Adopt Framework?
This might be the framework everybody is talking about. Company X is using it for everything from search to translation to recommendations. But what does the framework provide? Do you have to hire a team of Data Scientists to create capabilities on top the framework? The first step for any company that is starting to apply Data Science internally should be to figure out what can be done with available solutions. Research and development in Machine Learning is expensive, it can be a black hole for resources, and unnecessary on top of that. Even if you plan to pursue custom methods, it is necessary to establish a baseline for comparison.
2. Does the Framework Come with Ready-to-use Algorithms?
Developing algorithms and testing them at scale is very expensive. A framework that has a large number of usable algorithms in it is valuable, even if it may not be as hot in terms of perceived hype. The question is what can you do out of the box? Are there ready-to-use algorithms if you subscribe to the vendor?
3. Is there are a Community Around the Framework?
Some frameworks come with readily usable algorithms, others don’t. However some companies actively try to build a community around their framework. Google has done a good job on this front when it comes to TensorFlow. Even though there might not be a large number of provided models, there is a growing community, and freely available models can be found.
4. What are the Alternatives?
Prior to committing to any framework all options should be taken into consideration. In this article we have talked about major corporations. This does not mean that they are the only source of high quality open source software. Good open source frameworks have been contributed by various smaller organizations and start ups as well. Beyond that there is also the Apache Foundation. All options should be taken into consideration.
5. How Long Does it Take to Execute a Proof of Concept?
Provided that you have picked a framework that has ready-to-use algorithms executing a proof of concept should not take long. Thinking about a proof of concept may reveal other weaknesses within your organization or hidden challenges in adopting a framework.