DO NOT ADD CONTENT ABOVE HERE

Tips & Tricks

Top Tools for Data Scientists: Analytics Tools, Data Visualization Tools, Database Tools, and More

Data scientists are inquisitive and often seek out new tools that help them find answers. They also need to be proficient in using the tools of the trade, even though there are dozens upon dozens of them. Overall, data scientists should have a working knowledge of statistical programming languages for constructing data processing systems, databases, and visualization tools. Many in the field also deem a knowledge of programming an integral part of data science; however, not all data scientist students study programming, so it is helpful to be aware of tools that circumvent programming and include a user-friendly graphical interface so that data scientists’ knowledge of algorithms is enough to help them build predictive models.

With everything on a data scientist’s plate, you don’t have time to search for the tools of the trade that can help you do your work. That’s why we have rounded up tools that aid in data visualization, algorithms, statistical programming languages, and databases. We have chosen tools based on their ease of use, popularity, reputation, and features. And, we have listed our top tools for data scientists in alphabetical order to simplify your search; thus, they are not listed by any ranking or rating.

1. Algorithms.io
@algorithms_io

Algorithmsio

Algorithms.io is a LumenData Company providing machine learning as a service for streaming data from connected devices. This tool turns raw data into real-time insights and actionable events so that companies are in a better position to deploy machine learning for streaming data.

Key Features:

  • Simplifies the process of making machine learning accessible to companies and developers working with connected devices
  • Cloud platform addresses the common challenges with infrastructure, scale, and security that arise when deploying machine data
  • Creates a set of APIs for developers to use to integrate machine learning into web and mobile apps so that any application can turn raw streaming data into intelligent output

Cost: Contact for a quote

2. Apache Giraph

Apache Giraph

An iterative graph processing system designed for high scalability, Apache Giraph began as an open source counterpart to Pregel but adds multiple features beyond the basic Pregel model. Giraph is used by data scientists to “unleash the potential of structured datasets at a massive scale.”

Key Features:

  • Inspired by the Bulk Synchronous Parallel model of distributed computation as introduced by Leslie Valiant
  • Master computation
  • Sharded aggregators
  • Edge-oriented input
  • Out-of-core computation
  • Steady development cycle and growing community of users

Cost: FREE

3. Apache Hadoop
@hadoop

Apache Hadoop

Apache Hadoop is an open source software for reliable, distributed, scalable computing. A framework allowing for the distributed processing of large datasets across clusters of computers, the software library uses simple programming models. Hadoop is appropriate for research and production.

Key Features:

  • Designed to scale from single servers to thousands of machines
  • The library detects and handles failures at the application layer instead of relying on hardware to deliver high-availability
  • Includes the Hadoop Common, Hadoop Distributed File System (HDFS), Hadoop YARN, and Hadoop MapReduce modules

Cost: FREE

4. Apache HBase
@ApacheHBase

Apache HBase

The Hadoop database, Apache HBase is a distributed, scalable, big data store. Data scientists use this open source tool when they need random, real-time read/write access to Big Data. Apache HBase also provides capabilities similar to Bigtable on top of Hadoop and HDFS.

Key Features:

  • Open source, distributed, versioned, non-relational database modeled after Google’s Bigtable: A Distributed Storage System for Structured Data
  • Linear and modular scalability
  • Strictly consistent reads and writes
  • Automatic and configurable shading of tables

Cost: FREE

5. Apache Hive
@ApacheHive

Apache Hive

An Apache Software foundation Project, Apache Hive began as a subproject of Apache Hadoop and now is a top-level project itself. This tool is a data warehouse software that assists in reading, writing, and managing large datasets that reside in distributed storage using SQL.

Key Features:

  • Project structure onto data already in storage
  • Command line tool is provided to connect users to Hive
  • JDBC driver is provided to connect users to Hive

Cost: FREE

6. Apache Kafka
@apachekafka

Apache Kafka

A distributed streaming platform, Apache Kafka efficiently processes streams of data in real time. Data scientists use this tool to build real-time data pipelines and streaming apps because it empowers you to publish and subscribe to streams of records, store streams of records in a fault-tolerant way, and process streams of records as they occur.

Key Features:

  • Runs as a cluster on one or more servers
  • Cluster stores streams of records in categories called topics
  • Each record includes a key, value, and timestamp
  • Has of four core APIs: Producer API, Consumer API, Streams API, and Connector API

Cost: FREE

7. Apache Mahout
@ApacheMahout

Apache Mahout

An open source Apache Foundation project for machine learning, Apache Mahout aims to enable scalable machine learning and data mining. Specifically, the project’s goal is to “build an environment for quickly creating scalable performant machine learning applications.”

Key Features:

  • Simple, extensible programming environment and framework for building scalable algorithms
  • Includes a wide variety of pre-made algorithms for Scala + Apache Spark, H2O, and Apache Flink
  • Provides Samsara, a vector math experimentation environment with R-like syntax, which works at scale

Cost: FREE

8. Apache Mesos
@ApacheMesos

Apache Mesos

A cluster manager, Apache Mesos provides efficient resource isolation and sharing across distributed applications or frameworks. Mesos abstracts CPU, memory, storage, and other resources away from physical or virtual machines to enable fault-tolerant, elastic distributed systems to be built easily and run effectively.

Key Features:

  • Built using principles similar to that of the Linux kernel but at a different level of abstraction
  • Runs on every machine and provides applications like Hadoop and Spark with APIs for resource management and scheduling completely across datacenter and cloud environments
  • Easily scales to 10,000s of nodes
  • Non-disruptive upgrades for high availability
  • Cross platform and cloud provider agnostic

Cost: FREE

9. Apache Pig

Apache Pig

A platform designed for analyzing large datasets, Apache Pig consists of a high-level language for expressing data analysis programs that is coupled with infrastructure for evaluating such programs. Because Pig programs’ structures can handle significant parallelization, they can tackle large datasets.

Key Features:

  • Infrastructure consists of a compiler capable of producing sequences of Map-Reduce programs for which large-scale parallel implementations already exist
  • Language layer includes a textual language called Pig Latin
  • Key properties of Pig Latin include ease of programming, optimization opportunities, and extensibility

Cost: FREE

10. Apache Spark
@ApacheSpark

Apache Spark

Apache Spark delivers “lightning-fast cluster computing.” A wide range of organizations use Spark to process large datasets, and this data scientist tool can access diverse data sources such as HDFS, Cassandra, HBase, and S3.

Key Features:

  • Advanced DAG execution engine to support acyclic data flow and in-memory computing
  • More than 80 high-level operators make it simple to build parallel apps
  • Use interactively from the Scale, Python, and R shells
  • Powers a stack of libraries including SQL, DataFrames, MLlib, GraphX, and Spark Streaming

Cost: FREE

11. Apache Storm
@ApacheStorm
@stormprocessor

Apache Storm

Apache Storm is a tool for data scientists that handles distributed and fault-tolerant real-time computation. It also tackles stream processing, continuous computation, distributed RPC, and more.

Key Features:

  • Free and open source
  • Reliably process unbounded data streams for real-time processing
  • Use with any programming language
  • Use cases include real-time analytics, online machine learning, continuous computation, distributed RPC, ETL, and more
  • More than one million tuples processed per second per mode
  • Integrates with your existing queueing and database technologies

Cost: FREE

12. BigML
@bigmlcom

BigML

BigML makes machine learning simple. This company-wide platform runs in the cloud or on premises for operationalizing machine learning in organizations. BigML makes it simple to solve and automate classification, regression, cluster analysis, anomaly detection, association discovery, and topic modeling tasks.

Key Features:

  • Build sophisticated machine learning-based solutions affordably
  • Distill predictive patterns from data into practical, intelligent applications that anyone can use
  • The platform, private deployments, and rich toolset help users create, rapidly experiment, fully automate, and manage machine learning workflows to power intelligent applications

Cost: Contact for a quote

13. Bokeh
@BokehPlots

Bokeh

A Python interactive visualization library, Bokeh targets modern web browsers for presentation and helps users create interactive plots, dashboards, and data apps easily.

Key Features:

  • Provides elegant and concise construction of graphics similar to D3.js
  • Extends capabilities to high-performance interactivity over large or streaming datasets
  • Quickly and easily create interactive plots, dashboards, and data applications

Cost: FREE

14. Cascading
@cascading

Cascading

Cascading is an application development platform for data scientists building Big Data applications on Apache Hadoop. Users can solve simple and complex data problems with Cascading because it boasts computation engine, systems integration framework, data processing, and scheduling capabilities.

Key Features:

  • Balances an ideal level of abstraction with appropriate degrees of freedom
  • Offers Hadoop development teams portability
  • Change a few lines of cod and port Cascading to another supported compute fabric
  • Runs on and may be ported between MapReduce, Apache Tea, and Apache Flink

Cost: FREE

15. Clojure

Clojure

A robust and fast programming language, Clojure is a practical tool that marries the interactive development of a scripting language with an efficient infrastructure for multithreaded programming. Clojure is unique in that it is a compile language but remains dynamic with every feature supported at runtime.

Key Features:

  • Rich set of immutable, persistent data structures
  • Offers a software transactional memory system and reactive Agent system to ensure clean, correct, multithreaded designs when mutable state is necessary
  • Provides easy access to Java frameworks with optional type hints and type inference
  • Dynamic environment that users can interact with

Cost: FREE

16. D3.js
@mbostock

D3js

Committed to “code and data for humans,” Mike Bostock created D3.js. Data scientists use this tool, a JavaScript library for manipulating documents based on data, to add life to their data with SVG, Canvas, and HTML.

Key Features:

  • Emphasis on web standards to gain full capabilities of modern browsers without being tied to a proprietary framework
  • Combines powerful visualization components and a data-driven approach to Document Object Model (DOM) manipulation
  • Bind arbitrary data to a DOM and then apply data-driven transformations to the document

Cost: FREE

17. DataRobot
@DataRobot

DataRobot

An advanced machine learning automation platform, DataRobot helps data scientists build better predictive models faster. You can keep up with the ever-expanding ecosystem of machine learning algorithms easily when you use DataRobot.

Key Features:

  • Constantly expanding, vast set of diverse, best-in-class algorithms from leading sources
  • Train, test, and compare hundreds of varying models with one line of code or a single click
  • Automatically identifies top pre-processing and feature engineering for each modeling technique
  • Uses hundreds and even thousands of servers as well as multiple cores within each server to parallelize data exploration, model building, and hyper-parameter tuning
  • Easy model deployment

Cost: Contact for a quote

18. DataRPM
@DataRPM

DataRPM

DataRPM is the “industry’s first and only cognitive predictive maintenance platform for industrial IoT. DataRPM also is the recipient of the 2017 Technology Leadership Award for Cognitive Predictive Maintenance in Automotive Manufacturing from Frost & Sullivan.

Key Features:

  • Uses patent-pending meta-learning technology, an integral component of Artificial Intelligence, to automate predictions of asset failures
  • Runs multiple live automated machine learning experiments on datasets
  • Extracts data from every experiment, trains models on the metadata repository, applies models to predict the best algorithms, and builds machine-generated, human-verified machine learning models for predictive maintenance
  • Workflow uses recipes such as feature engineering, segmentation, influencing factors, and prediction recipes to deliver prescriptive recommendations

Cost: Contact for a quote

19. Excel
@Office

Excel

Many data scientists view Excel as a secret weapon. It is a familiar tool that scientists can rely on to quickly sort, filter, and work with their data. It’s also on nearly every computer you come across, so data scientists can work from just about anywhere with Excel.

Key Features:

  • Named ranges for creating a makeshift database
  • Sorting and filtering with one click to quickly and easily explore your dataset
  • Use Advanced Filtering to filter your dataset based on criteria you specify in a different range
  • Use pivot tables to cross-tabulate data and calculate counts, sums, and other metrics
  • Visual Basic provides a variety of creative solutions

Cost: FREE trial available

  • Home Buying Options
    • Office 365 Home: $99.99/year
    • Office 365 Personal: $69.99/year
    • Office Home & Student 2016 for PC $149.99 one-time purchase
  • Business Buying Options
    • Office 365 Business: $8.25/user/month with annual commitment
    • Office 365 Business Premium: $12.50/user/month with annual commitment
    • Office 365 Business Essentials: $5/user/month with annual commitment

20. Feature Labs

Feature Labs

An end-to-end data science solution, Feature Labs develops and deploys intelligent products and services for your data. They also work with data scientists to help you develop and deploy intelligent products, features, and services.

Key Features:

  • Integrates with your data to help scientists, developers, analysts, managers, and executives
  • Discover new insights and gain a better understanding of how your data forecasts the future of your business
  • On-boarding sessions tailored to your data and use cases to help you get off to an efficient start

Cost: Contact for a quote

21. ForecastThis
@forecastthis

ForecastThis

ForecastThis is a tool for data scientists that automates predictive model selection. The company strives to make deep learning relevant for finance and economics by enabling investment managers, quantitative analysts, and data scientists to use their own data to generate robust forecasts and optimize complex future objectives.

Key Features:

  • Simple API and spreadsheet plugins
  • Uniquely robust global optimization algorithms
  • Scales to challenges of nearly any shape or size
  • Algorithms create plausible, interpretable models of market processes to lend credibility to any output and help you get inside the market more successfully

Cost: Contact for a quote

22. Fusion Tables
@GoogleFT

Fusion Tables

Google Fusion Tables is a cloud-based data management service that focuses on collaboration, ease-of-use, and visualizations. An experimental app, Fusion Tables is a data visualization web application tool for data scientists that empowers you to gather, visualize, and share data tables.

Key Features:

  • Visualize bigger table data online
  • Combine with other data on the web
  • Make a map in minutes
  • Search thousands of public Fusion Tables or millions of public tables from the web that you can import to Fusion Tables
  • Import your own data and visualize it instantly
  • Publish your visualization on other web properties

Cost: FREE

23. Gawk

Gawk

GNU is an operating system that enables you to use a computer without software “that would trample your freedom.” They have created Gawk, an awk utility that interprets a special-purpose programming language. Gawk empowers users to handle simple data-reformatting jobs using only a few lines of code.

Key Features:

  • Search files for lines or other text units containing one or more patterns
  • Data-driven rather than procedural
  • Makes it easy to read and write programs

Cost: FREE

24. ggplot2
@hadleywickham
@winston_chang

ggplot2

Hadley Wickham and Winston Chang developed ggplot2, a plotting system for R that is based on the grammar of graphics. With ggplot2, data scientists can avoid many of the hassles of plotting while maintaining the attractive parts of base and lattice graphics and producing complex multi-layered graphics easily.

Key Features:

  • Create new types of graphic tailored to your needs
  • Create graphics to help you understand your data
  • Produce elegant graphics for data analysis

Cost: FREE

25. GraphLab Create

GraphLab Create

Data scientists and developers use GraphLab Create to build state-of-the-art data products via machine learning. This machine learning modeling tool helps users build intelligent applications end-to-end in Python.

Key Features:

  • Simplifies development of machine learning models
  • Incorporates automatic feature engineering, model selection, and machine learning visualizations specific to the application
  • Identify and link records within or across data sources corresponding to the same real-world entities

Cost: 

  • FREE one-year renewable subscription for academic use

26. IPython
@IPythonDev

IPython

Interactive Python tools, or IPython, is a growing project with expanding language-agnostic components and provides a rich architecture for interactive computing. An open source tool for data scientists, IPython supports Python 2.7 and 3.3 or newer.

Key Features:

  • A powerful interactive shell
  • A kernel for Jupyter
  • Support for interactive data visualization and use of GUI toolkits
  • Load flexible, embeddable interpreters into your own projects
  • Easy-to-use high performance parallel computing tools

Cost: FREE

27. Java
@SW_Java

Java

Java is a language with a broad user base that serves as a tool for data scientists creating products and frameworks involving distributed systems, data analysis, and machine learning. Java now is recognized as being just as important to data science as R and Python because it is robust, convenient, and scalable for data science applications.

Key Features:

  • Easy to break down and understand
  • Helps users be explicit about types of variables and data
  • Well-developed suite of tools
  • Develop and deploy applications on desktops and servers in addition to embedded environments
  • Rich user interface, performance, versatility, portability, and security for modern applications

Cost: FREE trial available; Contact for commercial license cost

28. Jupyter
@ProjectJupyter

Jupyter

Jupyter provides multi-language interactive computing environments. Its Notebook, an open source web application, allows data scientists to create and share documents containing live code, equations, visualizations, and explanatory text.

Key Features:

  • Uses include data cleaning and transformation, numerical simulation, statistical modeling, machine learning, and more
  • Supports more than 40 programming languages including popular data science languages like Python, R, Julia, and Scala
  • Share notebooks with others via email, Dropbox, GitHub, and the Jupyter Notebook Viewer
  • Code can produce images, videos, LaTeX, and JavaScript
  • Use interactive widgets to manipulate and visualize data in realtime

Cost: FREE

29. KNIME Analytics Platform
@knime

KNIME Analytics Platform

Thanks to its open platform, KNIME is a tool for navigating complex data freely. The KNIME Analytics Platform is a leading open solution for data-driven innovation to help data scientists uncover data’s hidden potential, mine for insights, and predict futures.

Key Features:

  • Enterprise-grade, open source platform
  • Deploy quickly and scale easily
  • More than 1,000 modules
  • Hundreds of ready-to-run examples
  • Comprehensive range of integrated tools
  • The widest choice of advanced algorithms available

Cost: FREE

30. Logical Glue
@logicalglue

Logical Glue

An award-winning white-box machine learning and artificial intelligence platform, Logical Glue increases productivity and profit for organizations. Data scientists choose this tool because it brings your insights to life for your audience.

Key Features:

  • Visual narratives that bring insights to life
  • Improve the communication and visualization of your insights more easily
  • Access new techniques with Fuzzy Logic and Artificial Neural Networks
  • Build the most accurate predictive models
  • Know exactly which data is predictive
  • Simple deployment and integration

Cost: Contact for a quote

31. MATLAB
@MATLAB

MATLAB

A high-level language and interactive environment for numerical computation, visualization, and programming, MATLAB is a powerful tool for data scientists. MATLAB serves as the language of technical computing and is useful for math, graphics, and programming.

Key Features:

  • Analyze data, develop algorithms, and create models
  • Designed to be intuitive
  • Combines a desktop environment for iterative analysis and design processes with a programming language capable of expressing matrix and array mathematics directly
  • Interactive apps to see how different algorithms work with your data
  • Automatically generate a MATLAB program to reproduce or automate your work after you’ve iterated and gotten the results you want
  • Scale analyses to run on clusters, GPUs, and clouds with simple code changes

Cost:

  • MATLAB Standard Individual: $2,150
  • MATLAB Academic Use, Individual: $500
  • Contact for other licensing options and pricing

32. Matplotlib
@matplotlib

Matplotlib

Matplotlib is a Python 2D plotting library that produces publication-quality figures in a variety of hardcopy formats and interactive environments across platforms. Data scientists use this tool in Python scripts, the Python and IPython shell, the Jupyter Notebook, web application servers, and four graphical user interface toolkits.

Key Features:

  • Generate plots, histograms, power spectra, bar charts, error charts, scatterplots, and more with a few lines of code
  • Full control of line styles, font properties, axes properties, etc. with an object-oriented interface or via a set of functions similar to MATLAB
  • Several Matplotlib add-on toolkits are available

Cost: FREE

33. MLBase
@amplab

MLBase

UC Berkeley’s AMPLab integrates algorithms, machines, and people to make sense of Big Data. They also developed MLBase, an open source project that makes distributed machine learning easier for data scientists.

Key Features:

  • Consists of three components: MLib, MLI, and ML Optimizer
  • MLib is Apache Spark’s distributed ML library
  • MLI is an experimental API for feature extraction and algorithm development introducing high-level machine learning programming abstractions
  • ML Optimizer automates the task of machine learning pipeline construction and solves a search problem over feature extractors and ML algorithms
  • Implement and consume machine learning at scale more easily

Cost: FREE

34. MySQL
@MySQL

MySQL

MySQL is one of today’s most popular open source databases. It’s also a popular tool for data scientists to use to access data from the database. Even though MySQL typically is software in web applications, it can be used in a variety of settings.

Key Features:

  • Open source relational database management system
  • Store and access your data in a structured way without hassles
  • Support data storage needs for production systems
  • Use with programming languages such as Java
  • Query data after designing the database

Cost: FREE

35. Narrative Science
@narrativesci

Narrative Science

Narrative Science helps enterprises maximize the impact of their data with automated, intelligent narratives generated by advanced narrative language generation (NLG). Data scientists humanize data with Narrative Science’s technology that interprets and then transforms data at unparalleled speed and scale.

Key Features:

  • Turn data into actionable, powerful assets for making better decisions
  • Help others in your organization understand and act on data
  • Integrates into existing business intelligence tools
  • Create a new reporting experience that drives better decisions more quickly

Cost: Contact for a quote

36. Natural Language Toolkit (NLTK)
@NLTK_org

Natural Language Toolkit NLTK

A leading platform for building Python programs, Natural Language Toolkit (NLTK) is a tool for working with human language data. NLTK is a helpful tool for inexperienced data scientists and data science students working in computational linguistics using Python.

Key Features:

  • Provides easy-to-use interfaces to more than 50 corpora and lexical resources
  • Includes a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and more
  • Learn more from the active discussion forum

Cost: FREE

37. NetworkX

NetworkX

NetworkX is a Python package tool for data scientists. Create, manipulate, and study the structure, dynamics, and functions of complex networks with NetworkX.

Key Features:

  • Data structures for graphs, digraphs, and multigraphs
  • Abundant standard graph algorithms
  • Network structure and analysis measures
  • Edges capable of holding arbitrary data
  • Generate classic graphs, random graphs, and synthetic networks

Cost: FREE

38. NumPy

NumPy

A fundamental package for scientific computing with Python, NumPy is well-suited to scientific uses. NumPy also serves as a multi-dimensional container of generic data.

Key Features:

  • Contains a powerful N-dimensional array object
  • Sophisticated broadcasting functions
  • Tools for integrating C/C++ and Fortran code
  • Define arbitrary data-types to seamlessly and speedily integrate with a wide variety of databases

Cost: FREE

39. Octave
@GnuOctave

Octave

GNU Octave is a scientific programming language that is a useful tool for data scientists looking to solve systems of equations or visualize data with high-level plot commands. This tool’s syntax is compatible with MATLAB, and its interpreter can be run in GUI mode, as a console, or invoked as part of a shell script.

Key Features:

  • Powerful math-oriented syntax with built-in plotting and visualization tools
  • Runs on GNU/Linux, MacOS, BSD, and Windows
  • Drop-in compatible with many MATLAB scripts
  • Use linear algebra operations on vectors and matrices to solve systems of equations
  • Use high-level plot commands in 2D and 3D to visualize data

Cost: FREE

40. OpenRefine
@OpenRefine

OpenRefine

OpenRefine is a powerful tool for data scientists who want to clean up, transform, and extend data with web services and then link it to databases. Formerly Google Refine, OpenRefine now is an open source project fully supported by volunteers.

Key Features:

  • Explore large datasets easily
  • Clean and transform data
  • Reconcile and match data
  • Link and extend datasets with a range of web services
  • You may upload cleaned data to a central database

Cost: FREE

41. pandas

pandas

pandas is an open source library that delivers high-performance, easy-to-use data structures and data analysis tools for the Python programming language. Data scientists use this tool when they need a Python data analysis library.

Key Features:

  • NUMFocus-sponsored project that secures development of pandas as a world-class, open source project
  • Fast, flexible, and expressive data structures make working with relational and labeled data easy and intuitive
  • Powerful and flexible open source data analysis and manipulation tool available in a variety of languages

Cost: FREE

42. RapidMiner
@RapidMiner

RapidMiner

Data scientists are more productive when they use RapidMiner, a unified platform for data prep, machine learning, and model deployment. A tool for making data science fast and simple, RapidMiner is a leader in the 2017 Gartner Magic Quadrant for Data Science Platforms, a leader in 2017 Forrester Wave for predictive analytics and machine learning, and a high performer in the G2 Crowd predictive analytics grid.

Key Features:

  • RapidMiner Studio is a visual workflow designer for data scientists
  • Share, reuse, and deploy predictive models from RapidMiner Studio with RapidMiner Server
  • Run data science workflows directly inside Hadoop with RapidMiner Radoop

Cost:

  • RapidMiner Studio
    • FREE – 10,000 rows of data and 1 logical processor
    • Small: $2,500/year – 100,000 rows of data and 2 logical processors
    • Medium: $5,000/year – 1,000,000 rows of data and 4 logical processors
    • Large: $10,000/year – Unlimited rows of data and unlimited logical processors
  • RapidMiner Server
    • FREE – 2 GB RAM, 1 logical processor, and 1,000 Web Service API calls
    • Small: $15,000/year – 16 GB RAM, 4 logical processors, and unlimited Web Service API calls
    • Medium: $30,000/year – 64 GB RAM, 8 logical processors, and unlimited Web Service API calls
    • Large: $60,000/year – Unlimited GB RAM, unlimited logical processors, and unlimited Web Service API calls
  • RapidMiner Radoop
    • FREE – Limited to a single user and community customer support
    • Enterprise: – $15,000/year – $5,000 for each additional user and enterprise customer support

43. Redis
@redisfeed

Redis

Redis is a data structure server that data scientists use as a database, cache, and message broker. This open source, in-memory data structure store supports strings, hashes, lists, and more.

Key Features:

  • Built-in replication, Lua scripting, LRU eviction, transactions, and different levels of on-disk persistence
  • High availability via Redis Sentinel and automatic partitioning with Redis cluster
  • Run automatic operations such as appending to a string, incrementing the value in a hash, pushing an element to a list, and more

Cost: FREE

44. RStudio
@rstudio

RStudio

RStudio is a tool for data scientists that is open source and enterprise-ready. This professional software for the R community makes R easier to use.

Key Features:

  • Includes a code editor, debugging, and visualization tools
  • Integrated development environment (IDE) for R
  • Includes a console, syntax-highlighting editor supporting direct code execution and tools for plotting, history, debugging, and workspace management
  • Available in open source and commercial editions and runs on the desktop or in a browser connected to RStudio Server or Studio Server Pro

Cost:

  • Open Source Edition: FREE
  • Commercial License: $995/year

45. Scala
@scala_lang

Scala

The Scala programming language is a tool for data scientists looking to construct elegant class hierarchies to maximize code reuse and extensibility. The tool also empowers users to implement class hierarchies’ behavior using higher-order functions.

Key Features:

  • Modern multi-paradigm programming language designed to express common programming patterns concisely and elegantly
  • Smoothly integrates features of object-oriented and functional languages
  • Supports higher-order functions and allows functions to be nested
  • Notion of pattern matching extended to the processing of XML data with the help of right-ignoring sequence patterns using a general extension via extractor objects

Cost: FREE

46. scikit-learn
@scikit_learn

scikit-learn

scikit-learn is an easy-to-use, general-purpose machine learning for Python. Data scientists prefer scikit-learn because it features simple, efficient tools for data mining and data analysis

Key Features:

  • Accessible to everyone and reusable in certain contexts
  • Built on NumPy, SciPy, and Matplotlib
  • Open source, commercially usable BSD license

Cost: FREE

47. SciPy

SciPy

SciPy, a Python-based ecosystem of open source software, is intended for for math, science, and engineering applications. The SciPy Stack includes Python, NumPy, Matplotlib, Python, the SciPy Library, and more.

Key Features:

  • Scientific computing tools for Python including a collection of open source software and a specified set of core packages
  • A community of people who use and develop the SciPy Stack
  • SciPy library provides several numerical routines

Cost: FREE

48. Shiny

Shiny

A web application framework for R by RStudio, Shiny is a tool data scientists use to turn analyses into interactive web applications. Shiny is an ideal tool for data scientists who are inexperienced in web development.

Key Features:

  • No HTML, CSS, or JavaScript knowledge required
  • Easy-to-write apps
  • Combines R’s computational power with the modern web’s interactivity
  • Use your own servers or RStudio’s hosting service

Cost: Contact for a quote

49. TensorFlow
@tensorflow

TensorFlow

TensorFlow is a fast, flexible, scalable open source machine learning library for research and production. Data scientists use TensorFlow for numerical computation using data flow graphs.

Key Features:

  • Flexible architecture for deploying computation to one or more CPUs or GPUs in a desktop, server, or mobile device with one API
  • Nodes in the graph represent mathematical operations, while graph edges represent the multidimensional data arrays communicated between them
  • Ideal for conducting machine learning and deep neural networks but applies to a wide variety of other domains

Cost: FREE

50. TIBCO Spotfire
@TIBCO

TIBCO Spotfire

TIBCO drives digital business by enabling better decisions and faster, smarter actions. Their Spotfire solution is a tool for data scientists that addresses data discovery, data wrangling, predictive analytics, and more.

Key Features:

  • Smart, secure, governed, enterprise-class analytics platform with built-in data wrangling
  • Delivers AI-driven, visual, geo, and streaming analytics
  • Smart visual data discovery with shortened time-to-insight
  • Data preparation features empower you to shape, enrich, and transform data and create features and identify signals for dashboards and actions

Cost: FREE trial available

  • Spotfire Cloud: $200/month or $2,000/year; Custom pricing also available
  • Spotfire Platform: Contact for a quote
  • Spotfire Cloud Enterprise: Contact for a quote

51. BONUS  Pxyll.com
@pyxll

This blog features a comprehensive list of tools for working with Python and Excel. It covers writing Excel Add-Ins in Python, reading and writing Excel files, and interacting with Excel. It’s a great resource for understanding the differences between all the different Python/Excel tools out there, and all in one place.

From Customer Data to Customer Experiences: Build Systems of Insight To Outperform The Competition