First things first, I need to be forthright an acknowledge the fact that I haven’t used, tried to use, or even know of every data tool out there. There are lots out there, each one with its unique functionality, helping facilitate different sets of steps in the data process, and there are constantly new tools coming to the market. So, yes, it is true, I have no familiarity with a lot of the popular data tools out there such as R, Alteryx, Power BI, just to name a few.
Alternatively, for this post, I will be directly comparing Python to the data tools I have familiarity with: SQL, Excel, and Tableau. These programs are relatively ubiquitous in data world in their own right –most people that have ever interacted with a computer have most likely opened Excel at some point and seems to be the gateway tool for new and aspiring data folk like myself. Furthermore, Microsoft and Tableau both issue proprietary certificates for various levels of mastery — posts of these hard-earned certificates make up the majority of my LinkedIn feed. SQL is open source, but there are plenty of institutions and individuals who make a living peddling courses and certificates for this tool as well.
Now I need to disclose that although I feel comfortable to write about these data tools, I am by no means an expert, so my assessments may have blind spots.
Excel has a fairly intuitive graphical user interface in which all of your data can be visible at all times. This makes it is extremely user friendly, contributing to its status as a beginner data tool. Excel has its own format of spreadsheet file but is also compatible with CSV files. It has a pretty vast range of functionality, spanning from simply summing a list of numbers, to the creation of PivotTables, which dynamically tabulate column aggregates.
Excel capabilities offer a really hands-on approach to the data cleaning process, allowing users to easily click on individual cells to change values, which sometimes can be truly convenient. Additionally, there are many functions that can be called on to help with data cleaning such as TRIM, CONCAT, COUNTBLANK, and various iterations of IF.
Among other facets of the data process, Excel serves as an introduction to data-visualization — once data is prepped and selected, graphs and charts (including the dreaded pie-chart) are just a click away, with detail options like axes labels, title, trendlines easily accessible. Excel also has a variety of add-ins, including Data ToolPak, which within itself contains several math and statistics-oriented formulas, which can be useful for the modelling phase.
Excel also offers the chance to automate through its macros, and its proprietary script VBA. I do not have a lot of experience creating macros but more over VBA seems clunky and narrow in its scope.
SQL stands for Structured Query Language and it used to extract data from databases. Accessible databases generally exist on private servers, unless you create a database from a CSV file. So unlike Excel and Tableau, SQL is a script with its own unique syntax, and has no GUI of its own. This can feel intimidating for someone without scripting experience, but I find that because of its smaller set of commands and the concentrated focus of its functionality, it is an excellent warm-up to other languages such as Python.
Although there is a section of SQL commands dealing with database table management and cleaning, this is not where it thrives. SQL shines as tool for exploratory data analysis. SQL is powerful in its brevity; it has the ability to answer specific, nuanced questions about a dataset in a few short lines of code. Filter through numeric or string rows using the WHERE command, categorize query results using GROUPBY, use JOIN to connect different tables — just to name a few.
There are no inherent modelling mechanisms, so data retrieved has to exported elsewhere. SQL also has no ways to visualize data in its own script — I usually query my data and export results as a CSV which is then placed into Tableau. Furthermore, SQL lacks internal mechanisms for automation, but certain API’s provide (clunky) options to do so.
Tableau is almost exclusively for data visualization, and does it well. It has a very pleasant GUI and is smooth to navigate. It takes a multitude of file types including Excel spreadsheets and CSVs and allows users to join tables from multiple sources on columns if applicable. Users can easily use the features already present in a table, or they can create their own calculated fields using a simple script that is reminiscent of an Excel formula/SQL query syntax hybrid.
Once all tables have been imported, joined, and all features have been created — stunning data visualizations of all types can be made almost instantaneously by dragging features and clicking a few buttons. Tableau has inherent geo-coding that really simplifies the creation of info-maps (my personal favorite).
Along with stand-alone charts, maps and graphs, data dashboards can be made which allow not only users, but any stakeholder to gain deeper insights by filtering and grouping, cross-referenced though several visuals simultaneously. According to the Tableau website automation for particular functions can be automated through recommended APIs.
Python’s functionality seems limitless. It can be used to so many different situations — going well beyond the confines of data-oriented applications — which only makes it more robust and a better data tool.
From the PANDAS library we have data frames — the Python equivalent to a spreadsheet in Excel or table in SQL. Data cleaning can sometimes feel like you’re in the dark because you can’t see your entire data frame unless you specify initial settings to do so. Changing a single value can feel tedious as there is a lot of syntax to get around compared to clicking on cell and typing, but when this same methodology is applied on a macro scale, it is very convenient. Likewise, combining data frames — using concat, merge, or join — is similar in ease to SQL and Tableau, but exponentially easier than using a VLOOKUP, HLOOKUP, or Match/Index.
Python, by default, includes tons of math and statistic focused libraries, such as NumPy and scikit-learn, but also includes libraries for auxiliary functions like web-scraping. Additional user generated libraries can be added-in easily. The freedom that users have to create their own functions, store those functions in bespoke classes and libraries is very liberating. These capabilities maximize the extent of customization, yet its functionality can go further, as users can even create custom scripts that will automate a series of specific tasks — embodying the essence of a computers purpose.
Seaborn and MatPlotLib are just two Python libraries that can be implemented to create to data-visualizations. I won’t say that visualizing data in Python is necessarily as convenient as it is on Tableau, or even Excel — there is the lingering syntax that is dense with brackets, parentheses, commas, and quotes, which certainly time and practice to get the hang of. However, once one begins to wrap their head around how to communicate their needs to Python, data visualization production can be rapid and precise, with a myriad of different formats to choose from. I personally love the ease of creating correlation matrices through Seaborn’s .heatmap(), and its more detailed relative, .pairplot().
Where Python truly wins over Excel, Tableau, and SQL is its ability to create predictive models, and its complementary tools which assistance in model production. Excel can provide some guidance in the modelling stage, but it’s minimal and limited in options. From simple linear regression to neural networks, Python basically has a model for every situation. Furthermore, with tools like GridSearch, Python can tell you the specific hyper-parameters to include to optimize your model, with results that maximize predictive power.
In sum, Excel, SQL, and Tableau are great data tools, but even their combined greatness falls short of Python’s power. Excel is nice because its user friendly and can be used in almost all stages of the data process, but within these steps it comes across as mediocre and frumpy. SQL provides a strong and succinct methodology for exploratory data analysis, but that’s about it.
Likewise Tableau produces elegant visualizations in an intuitive way, but can’t be applied to other parts of the data process. So to reiterate, Python is the best data tool because it provides versatility and thoroughness, and simultaneously can be applied to every step in the data process, start to finish.