Data scientists are inquisitive and often seek out new tools that help them find answers. They also need to be proficient in using the tools of the trade, even though there are dozens upon dozens of them. Overall, data scientists should have a working knowledge of statistical programming languages for constructing data processing systems, databases, and visualization tools. Many in the field also deem a knowledge of programming an integral part of data science; however, not all data scientist students study programming, so it is helpful to be aware of tools that circumvent programming and include a user-friendly graphical interface so that data scientists’ knowledge of algorithms is enough to help them build predictive models.
With everything on a data scientist’s plate, you don’t have time to search for the tools of the trade that can help you do your work. That’s why we have rounded up tools that aid in data visualization, algorithms, statistical programming languages, and databases. We have chosen tools based on their ease of use, popularity, reputation, and features. And, we have listed our top tools for data scientists in alphabetical order to simplify your search; thus, they are not listed by any ranking or rating.
Algorithms.io is a LumenData Company providing machine learning as a service for streaming data from connected devices. This tool turns raw data into real-time insights and actionable events so that companies are in a better position to deploy machine learning for streaming data.
- Simplifies the process of making machine learning accessible to companies and developers working with connected devices
- Cloud platform addresses the common challenges with infrastructure, scale, and security that arise when deploying machine data
- Creates a set of APIs for developers to use to integrate machine learning into web and mobile apps so that any application can turn raw streaming data into intelligent output
Cost: Contact for a quote
2. Apache Giraph
An iterative graph processing system designed for high scalability, Apache Giraph began as an open source counterpart to Pregel but adds multiple features beyond the basic Pregel model. Giraph is used by data scientists to “unleash the potential of structured datasets at a massive scale.”
- Inspired by the Bulk Synchronous Parallel model of distributed computation as introduced by Leslie Valiant
- Master computation
- Sharded aggregators
- Edge-oriented input
- Out-of-core computation
- Steady development cycle and growing community of users
3. Apache Hadoop
Apache Hadoop is an open source software for reliable, distributed, scalable computing. A framework allowing for the distributed processing of large datasets across clusters of computers, the software library uses simple programming models. Hadoop is appropriate for research and production.
- Designed to scale from single servers to thousands of machines
- The library detects and handles failures at the application layer instead of relying on hardware to deliver high-availability
- Includes the Hadoop Common, Hadoop Distributed File System (HDFS), Hadoop YARN, and Hadoop MapReduce modules
4. Apache HBase
The Hadoop database, Apache HBase is a distributed, scalable, big data store. Data scientists use this open source tool when they need random, real-time read/write access to Big Data. Apache HBase also provides capabilities similar to Bigtable on top of Hadoop and HDFS.
- Open source, distributed, versioned, non-relational database modeled after Google’s Bigtable: A Distributed Storage System for Structured Data
- Linear and modular scalability
- Strictly consistent reads and writes
- Automatic and configurable shading of tables
5. Apache Hive
An Apache Software foundation Project, Apache Hive began as a subproject of Apache Hadoop and now is a top-level project itself. This tool is a data warehouse software that assists in reading, writing, and managing large datasets that reside in distributed storage using SQL.
- Project structure onto data already in storage
- Command line tool is provided to connect users to Hive
- JDBC driver is provided to connect users to Hive
6. Apache Kafka
A distributed streaming platform, Apache Kafka efficiently processes streams of data in real time. Data scientists use this tool to build real-time data pipelines and streaming apps because it empowers you to publish and subscribe to streams of records, store streams of records in a fault-tolerant way, and process streams of records as they occur.
- Runs as a cluster on one or more servers
- Cluster stores streams of records in categories called topics
- Each record includes a key, value, and timestamp
- Has of four core APIs: Producer API, Consumer API, Streams API, and Connector API
7. Apache Mahout
An open source Apache Foundation project for machine learning, Apache Mahout aims to enable scalable machine learning and data mining. Specifically, the project’s goal is to “build an environment for quickly creating scalable performant machine learning applications.”
- Simple, extensible programming environment and framework for building scalable algorithms
- Includes a wide variety of pre-made algorithms for Scala + Apache Spark, H2O, and Apache Flink
- Provides Samsara, a vector math experimentation environment with R-like syntax, which works at scale
8. Apache Mesos
A cluster manager, Apache Mesos provides efficient resource isolation and sharing across distributed applications or frameworks. Mesos abstracts CPU, memory, storage, and other resources away from physical or virtual machines to enable fault-tolerant, elastic distributed systems to be built easily and run effectively.
- Built using principles similar to that of the Linux kernel but at a different level of abstraction
- Runs on every machine and provides applications like Hadoop and Spark with APIs for resource management and scheduling completely across datacenter and cloud environments
- Easily scales to 10,000s of nodes
- Non-disruptive upgrades for high availability
- Cross platform and cloud provider agnostic
9. Apache Pig
A platform designed for analyzing large datasets, Apache Pig consists of a high-level language for expressing data analysis programs that is coupled with infrastructure for evaluating such programs. Because Pig programs’ structures can handle significant parallelization, they can tackle large datasets.
- Infrastructure consists of a compiler capable of producing sequences of Map-Reduce programs for which large-scale parallel implementations already exist
- Language layer includes a textual language called Pig Latin
- Key properties of Pig Latin include ease of programming, optimization opportunities, and extensibility
10. Apache Spark
Apache Spark delivers “lightning-fast cluster computing.” A wide range of organizations use Spark to process large datasets, and this data scientist tool can access diverse data sources such as HDFS, Cassandra, HBase, and S3.
- Advanced DAG execution engine to support acyclic data flow and in-memory computing
- More than 80 high-level operators make it simple to build parallel apps
- Use interactively from the Scale, Python, and R shells
- Powers a stack of libraries including SQL, DataFrames, MLlib, GraphX, and Spark Streaming
11. Apache Storm
Apache Storm is a tool for data scientists that handles distributed and fault-tolerant real-time computation. It also tackles stream processing, continuous computation, distributed RPC, and more.
- Free and open source
- Reliably process unbounded data streams for real-time processing
- Use with any programming language
- Use cases include real-time analytics, online machine learning, continuous computation, distributed RPC, ETL, and more
- More than one million tuples processed per second per mode
- Integrates with your existing queueing and database technologies
BigML makes machine learning simple. This company-wide platform runs in the cloud or on premises for operationalizing machine learning in organizations. BigML makes it simple to solve and automate classification, regression, cluster analysis, anomaly detection, association discovery, and topic modeling tasks.
- Build sophisticated machine learning-based solutions affordably
- Distill predictive patterns from data into practical, intelligent applications that anyone can use
- The platform, private deployments, and rich toolset help users create, rapidly experiment, fully automate, and manage machine learning workflows to power intelligent applications
Cost: Contact for a quote
A Python interactive visualization library, Bokeh targets modern web browsers for presentation and helps users create interactive plots, dashboards, and data apps easily.
- Provides elegant and concise construction of graphics similar to D3.js
- Extends capabilities to high-performance interactivity over large or streaming datasets
- Quickly and easily create interactive plots, dashboards, and data applications
Cascading is an application development platform for data scientists building Big Data applications on Apache Hadoop. Users can solve simple and complex data problems with Cascading because it boasts computation engine, systems integration framework, data processing, and scheduling capabilities.
- Balances an ideal level of abstraction with appropriate degrees of freedom
- Offers Hadoop development teams portability
- Change a few lines of cod and port Cascading to another supported compute fabric
- Runs on and may be ported between MapReduce, Apache Tea, and Apache Flink
A robust and fast programming language, Clojure is a practical tool that marries the interactive development of a scripting language with an efficient infrastructure for multithreaded programming. Clojure is unique in that it is a compile language but remains dynamic with every feature supported at runtime.
- Rich set of immutable, persistent data structures
- Offers a software transactional memory system and reactive Agent system to ensure clean, correct, multithreaded designs when mutable state is necessary
- Provides easy access to Java frameworks with optional type hints and type inference
- Dynamic environment that users can interact with
- Emphasis on web standards to gain full capabilities of modern browsers without being tied to a proprietary framework
- Combines powerful visualization components and a data-driven approach to Document Object Model (DOM) manipulation
- Bind arbitrary data to a DOM and then apply data-driven transformations to the document
An advanced machine learning automation platform, DataRobot helps data scientists build better predictive models faster. You can keep up with the ever-expanding ecosystem of machine learning algorithms easily when you use DataRobot.
- Constantly expanding, vast set of diverse, best-in-class algorithms from leading sources
- Train, test, and compare hundreds of varying models with one line of code or a single click
- Automatically identifies top pre-processing and feature engineering for each modeling technique
- Uses hundreds and even thousands of servers as well as multiple cores within each server to parallelize data exploration, model building, and hyper-parameter tuning
- Easy model deployment
Cost: Contact for a quote
DataRPM is the “industry’s first and only cognitive predictive maintenance platform for industrial IoT. DataRPM also is the recipient of the 2017 Technology Leadership Award for Cognitive Predictive Maintenance in Automotive Manufacturing from Frost & Sullivan.
- Uses patent-pending meta-learning technology, an integral component of Artificial Intelligence, to automate predictions of asset failures
- Runs multiple live automated machine learning experiments on datasets
- Extracts data from every experiment, trains models on the metadata repository, applies models to predict the best algorithms, and builds machine-generated, human-verified machine learning models for predictive maintenance
- Workflow uses recipes such as feature engineering, segmentation, influencing factors, and prediction recipes to deliver prescriptive recommendations
Cost: Contact for a quote
Many data scientists view Excel as a secret weapon. It is a familiar tool that scientists can rely on to quickly sort, filter, and work with their data. It’s also on nearly every computer you come across, so data scientists can work from just about anywhere with Excel.
- Named ranges for creating a makeshift database
- Sorting and filtering with one click to quickly and easily explore your dataset
- Use Advanced Filtering to filter your dataset based on criteria you specify in a different range
- Use pivot tables to cross-tabulate data and calculate counts, sums, and other metrics
- Visual Basic provides a variety of creative solutions
Cost: FREE trial available
- Home Buying Options
- Office 365 Home: $99.99/year
- Office 365 Personal: $69.99/year
- Office Home & Student 2016 for PC $149.99 one-time purchase
- Business Buying Options
- Office 365 Business: $8.25/user/month with annual commitment
- Office 365 Business Premium: $12.50/user/month with annual commitment
- Office 365 Business Essentials: $5/user/month with annual commitment
20. Feature Labs
An end-to-end data science solution, Feature Labs develops and deploys intelligent products and services for your data. They also work with data scientists to help you develop and deploy intelligent products, features, and services.
- Integrates with your data to help scientists, developers, analysts, managers, and executives
- Discover new insights and gain a better understanding of how your data forecasts the future of your business
- On-boarding sessions tailored to your data and use cases to help you get off to an efficient start
Cost: Contact for a quote
ForecastThis is a tool for data scientists that automates predictive model selection. The company strives to make deep learning relevant for finance and economics by enabling investment managers, quantitative analysts, and data scientists to use their own data to generate robust forecasts and optimize complex future objectives.
- Simple API and spreadsheet plugins
- Uniquely robust global optimization algorithms
- Scales to challenges of nearly any shape or size
- Algorithms create plausible, interpretable models of market processes to lend credibility to any output and help you get inside the market more successfully
Cost: Contact for a quote
22. Fusion Tables
Google Fusion Tables is a cloud-based data management service that focuses on collaboration, ease-of-use, and visualizations. An experimental app, Fusion Tables is a data visualization web application tool for data scientists that empowers you to gather, visualize, and share data tables.
- Visualize bigger table data online
- Combine with other data on the web
- Make a map in minutes
- Search thousands of public Fusion Tables or millions of public tables from the web that you can import to Fusion Tables
- Import your own data and visualize it instantly
- Publish your visualization on other web properties
GNU is an operating system that enables you to use a computer without software “that would trample your freedom.” They have created Gawk, an awk utility that interprets a special-purpose programming language. Gawk empowers users to handle simple data-reformatting jobs using only a few lines of code.
- Search files for lines or other text units containing one or more patterns
- Data-driven rather than procedural
- Makes it easy to read and write programs
Hadley Wickham and Winston Chang developed ggplot2, a plotting system for R that is based on the grammar of graphics. With ggplot2, data scientists can avoid many of the hassles of plotting while maintaining the attractive parts of base and lattice graphics and producing complex multi-layered graphics easily.
- Create new types of graphic tailored to your needs
- Create graphics to help you understand your data
- Produce elegant graphics for data analysis
25. GraphLab Create
Data scientists and developers use GraphLab Create to build state-of-the-art data products via machine learning. This machine learning modeling tool helps users build intelligent applications end-to-end in Python.
- Simplifies development of machine learning models
- Incorporates automatic feature engineering, model selection, and machine learning visualizations specific to the application
- Identify and link records within or across data sources corresponding to the same real-world entities
- FREE one-year renewable subscription for academic use
Interactive Python tools, or IPython, is a growing project with expanding language-agnostic components and provides a rich architecture for interactive computing. An open source tool for data scientists, IPython supports Python 2.7 and 3.3 or newer.
- A powerful interactive shell
- A kernel for Jupyter
- Support for interactive data visualization and use of GUI toolkits
- Load flexible, embeddable interpreters into your own projects
- Easy-to-use high performance parallel computing tools
Java is a language with a broad user base that serves as a tool for data scientists creating products and frameworks involving distributed systems, data analysis, and machine learning. Java now is recognized as being just as important to data science as R and Python because it is robust, convenient, and scalable for data science applications.
- Easy to break down and understand
- Helps users be explicit about types of variables and data
- Well-developed suite of tools
- Develop and deploy applications on desktops and servers in addition to embedded environments
- Rich user interface, performance, versatility, portability, and security for modern applications
Cost: FREE trial available; Contact for commercial license cost
Jupyter provides multi-language interactive computing environments. Its Notebook, an open source web application, allows data scientists to create and share documents containing live code, equations, visualizations, and explanatory text.
- Uses include data cleaning and transformation, numerical simulation, statistical modeling, machine learning, and more
- Supports more than 40 programming languages including popular data science languages like Python, R, Julia, and Scala
- Share notebooks with others via email, Dropbox, GitHub, and the Jupyter Notebook Viewer
- Use interactive widgets to manipulate and visualize data in realtime
29. KNIME Analytics Platform
Thanks to its open platform, KNIME is a tool for navigating complex data freely. The KNIME Analytics Platform is a leading open solution for data-driven innovation to help data scientists uncover data’s hidden potential, mine for insights, and predict futures.
- Enterprise-grade, open source platform
- Deploy quickly and scale easily
- More than 1,000 modules
- Hundreds of ready-to-run examples
- Comprehensive range of integrated tools
- The widest choice of advanced algorithms available
30. Logical Glue
An award-winning white-box machine learning and artificial intelligence platform, Logical Glue increases productivity and profit for organizations. Data scientists choose this tool because it brings your insights to life for your audience.
- Visual narratives that bring insights to life
- Improve the communication and visualization of your insights more easily
- Access new techniques with Fuzzy Logic and Artificial Neural Networks
- Build the most accurate predictive models
- Know exactly which data is predictive
- Simple deployment and integration
Cost: Contact for a quote
A high-level language and interactive environment for numerical computation, visualization, and programming, MATLAB is a powerful tool for data scientists. MATLAB serves as the language of technical computing and is useful for math, graphics, and programming.
- Analyze data, develop algorithms, and create models
- Designed to be intuitive
- Combines a desktop environment for iterative analysis and design processes with a programming language capable of expressing matrix and array mathematics directly
- Interactive apps to see how different algorithms work with your data
- Automatically generate a MATLAB program to reproduce or automate your work after you’ve iterated and gotten the results you want
- Scale analyses to run on clusters, GPUs, and clouds with simple code changes
- MATLAB Standard Individual: $2,150
- MATLAB Academic Use, Individual: $500
- Contact for other licensing options and pricing
Matplotlib is a Python 2D plotting library that produces publication-quality figures in a variety of hardcopy formats and interactive environments across platforms. Data scientists use this tool in Python scripts, the Python and IPython shell, the Jupyter Notebook, web application servers, and four graphical user interface toolkits.
- Generate plots, histograms, power spectra, bar charts, error charts, scatterplots, and more with a few lines of code
- Full control of line styles, font properties, axes properties, etc. with an object-oriented interface or via a set of functions similar to MATLAB
- Several Matplotlib add-on toolkits are available
UC Berkeley’s AMPLab integrates algorithms, machines, and people to make sense of Big Data. They also developed MLBase, an open source project that makes distributed machine learning easier for data scientists.
- Consists of three components: MLib, MLI, and ML Optimizer
- MLib is Apache Spark’s distributed ML library
- MLI is an experimental API for feature extraction and algorithm development introducing high-level machine learning programming abstractions
- ML Optimizer automates the task of machine learning pipeline construction and solves a search problem over feature extractors and ML algorithms
- Implement and consume machine learning at scale more easily
MySQL is one of today’s most popular open source databases. It’s also a popular tool for data scientists to use to access data from the database. Even though MySQL typically is software in web applications, it can be used in a variety of settings.
- Open source relational database management system
- Store and access your data in a structured way without hassles
- Support data storage needs for production systems
- Use with programming languages such as Java
- Query data after designing the database
35. Narrative Science
Narrative Science helps enterprises maximize the impact of their data with automated, intelligent narratives generated by advanced narrative language generation (NLG). Data scientists humanize data with Narrative Science’s technology that interprets and then transforms data at unparalleled speed and scale.
- Turn data into actionable, powerful assets for making better decisions
- Help others in your organization understand and act on data
- Integrates into existing business intelligence tools
- Create a new reporting experience that drives better decisions more quickly
Cost: Contact for a quote
36. Natural Language Toolkit (NLTK)
A leading platform for building Python programs, Natural Language Toolkit (NLTK) is a tool for working with human language data. NLTK is a helpful tool for inexperienced data scientists and data science students working in computational linguistics using Python.
- Provides easy-to-use interfaces to more than 50 corpora and lexical resources
- Includes a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and more
- Learn more from the active discussion forum
NetworkX is a Python package tool for data scientists. Create, manipulate, and study the structure, dynamics, and functions of complex networks with NetworkX.
- Data structures for graphs, digraphs, and multigraphs
- Abundant standard graph algorithms
- Network structure and analysis measures
- Edges capable of holding arbitrary data
- Generate classic graphs, random graphs, and synthetic networks
A fundamental package for scientific computing with Python, NumPy is well-suited to scientific uses. NumPy also serves as a multi-dimensional container of generic data.
- Contains a powerful N-dimensional array object
- Sophisticated broadcasting functions
- Tools for integrating C/C++ and Fortran code
- Define arbitrary data-types to seamlessly and speedily integrate with a wide variety of databases
GNU Octave is a scientific programming language that is a useful tool for data scientists looking to solve systems of equations or visualize data with high-level plot commands. This tool’s syntax is compatible with MATLAB, and its interpreter can be run in GUI mode, as a console, or invoked as part of a shell script.
- Powerful math-oriented syntax with built-in plotting and visualization tools
- Runs on GNU/Linux, MacOS, BSD, and Windows
- Drop-in compatible with many MATLAB scripts
- Use linear algebra operations on vectors and matrices to solve systems of equations
- Use high-level plot commands in 2D and 3D to visualize data
OpenRefine is a powerful tool for data scientists who want to clean up, transform, and extend data with web services and then link it to databases. Formerly Google Refine, OpenRefine now is an open source project fully supported by volunteers.
- Explore large datasets easily
- Clean and transform data
- Reconcile and match data
- Link and extend datasets with a range of web services
- You may upload cleaned data to a central database
pandas is an open source library that delivers high-performance, easy-to-use data structures and data analysis tools for the Python programming language. Data scientists use this tool when they need a Python data analysis library.
- NUMFocus-sponsored project that secures development of pandas as a world-class, open source project
- Fast, flexible, and expressive data structures make working with relational and labeled data easy and intuitive
- Powerful and flexible open source data analysis and manipulation tool available in a variety of languages
Data scientists are more productive when they use RapidMiner, a unified platform for data prep, machine learning, and model deployment. A tool for making data science fast and simple, RapidMiner is a leader in the 2017 Gartner Magic Quadrant for Data Science Platforms, a leader in 2017 Forrester Wave for predictive analytics and machine learning, and a high performer in the G2 Crowd predictive analytics grid.
- RapidMiner Studio is a visual workflow designer for data scientists
- Share, reuse, and deploy predictive models from RapidMiner Studio with RapidMiner Server
- Run data science workflows directly inside Hadoop with RapidMiner Radoop
- RapidMiner Studio
- FREE – 10,000 rows of data and 1 logical processor
- Small: $2,500/year – 100,000 rows of data and 2 logical processors
- Medium: $5,000/year – 1,000,000 rows of data and 4 logical processors
- Large: $10,000/year – Unlimited rows of data and unlimited logical processors
- RapidMiner Server
- FREE – 2 GB RAM, 1 logical processor, and 1,000 Web Service API calls
- Small: $15,000/year – 16 GB RAM, 4 logical processors, and unlimited Web Service API calls
- Medium: $30,000/year – 64 GB RAM, 8 logical processors, and unlimited Web Service API calls
- Large: $60,000/year – Unlimited GB RAM, unlimited logical processors, and unlimited Web Service API calls
- RapidMiner Radoop
- FREE – Limited to a single user and community customer support
- Enterprise: – $15,000/year – $5,000 for each additional user and enterprise customer support
Redis is a data structure server that data scientists use as a database, cache, and message broker. This open source, in-memory data structure store supports strings, hashes, lists, and more.
- Built-in replication, Lua scripting, LRU eviction, transactions, and different levels of on-disk persistence
- High availability via Redis Sentinel and automatic partitioning with Redis cluster
- Run automatic operations such as appending to a string, incrementing the value in a hash, pushing an element to a list, and more
RStudio is a tool for data scientists that is open source and enterprise-ready. This professional software for the R community makes R easier to use.
- Includes a code editor, debugging, and visualization tools
- Integrated development environment (IDE) for R
- Includes a console, syntax-highlighting editor supporting direct code execution and tools for plotting, history, debugging, and workspace management
- Available in open source and commercial editions and runs on the desktop or in a browser connected to RStudio Server or Studio Server Pro
- Open Source Edition: FREE
- Commercial License: $995/year
The Scala programming language is a tool for data scientists looking to construct elegant class hierarchies to maximize code reuse and extensibility. The tool also empowers users to implement class hierarchies’ behavior using higher-order functions.
- Modern multi-paradigm programming language designed to express common programming patterns concisely and elegantly
- Smoothly integrates features of object-oriented and functional languages
- Supports higher-order functions and allows functions to be nested
- Notion of pattern matching extended to the processing of XML data with the help of right-ignoring sequence patterns using a general extension via extractor objects
scikit-learn is an easy-to-use, general-purpose machine learning for Python. Data scientists prefer scikit-learn because it features simple, efficient tools for data mining and data analysis
- Accessible to everyone and reusable in certain contexts
- Built on NumPy, SciPy, and Matplotlib
- Open source, commercially usable BSD license
SciPy, a Python-based ecosystem of open source software, is intended for for math, science, and engineering applications. The SciPy Stack includes Python, NumPy, Matplotlib, Python, the SciPy Library, and more.
- Scientific computing tools for Python including a collection of open source software and a specified set of core packages
- A community of people who use and develop the SciPy Stack
- SciPy library provides several numerical routines
A web application framework for R by RStudio, Shiny is a tool data scientists use to turn analyses into interactive web applications. Shiny is an ideal tool for data scientists who are inexperienced in web development.
- Easy-to-write apps
- Combines R’s computational power with the modern web’s interactivity
- Use your own servers or RStudio’s hosting service
Cost: Contact for a quote
TensorFlow is a fast, flexible, scalable open source machine learning library for research and production. Data scientists use TensorFlow for numerical computation using data flow graphs.
- Flexible architecture for deploying computation to one or more CPUs or GPUs in a desktop, server, or mobile device with one API
- Nodes in the graph represent mathematical operations, while graph edges represent the multidimensional data arrays communicated between them
- Ideal for conducting machine learning and deep neural networks but applies to a wide variety of other domains
50. TIBCO Spotfire
TIBCO drives digital business by enabling better decisions and faster, smarter actions. Their Spotfire solution is a tool for data scientists that addresses data discovery, data wrangling, predictive analytics, and more.
- Smart, secure, governed, enterprise-class analytics platform with built-in data wrangling
- Delivers AI-driven, visual, geo, and streaming analytics
- Smart visual data discovery with shortened time-to-insight
- Data preparation features empower you to shape, enrich, and transform data and create features and identify signals for dashboards and actions
Cost: FREE trial available
- Spotfire Cloud: $200/month or $2,000/year; Custom pricing also available
- Spotfire Platform: Contact for a quote
- Spotfire Cloud Enterprise: Contact for a quote