Introduction to Python for Data Analysis

Recall that R is a statistical programming language—a language designed to do things like t-tests, regression, and so on. The core of R was developed during the 1970s and since then, many libraries (such as the Tidyverse for data manipulation) have been developed to greatly extend the functionality of the language.

Python, on the other hand, is a general purpose computer language. It can be used to create just about any kind of software that can be written on a computer, including Windows-based applications. However, in recent years the open source community has developed increasingly-sophisticated data manipulation, statistical analysis, and machine learning libraries for Python. We are now at the point that R and Python are roughly comparable in functionality.

My guess is that Python will eventually supersede R for most data manipulation analysis and tasks. The underlying Python language is modern and clean. So much of the syntactic weirdness of R and the Tidyverse are missing from Python. There are notable exceptions, of course. Some Python libraries like statsmodels were designed specifically to use R-like syntax.

Format of these tutorials

Recall the approach in this course is to do the same task using different tools (Excel, SAS Enterprise Guide, R). At this point, you should have a good understanding of the underlying statistics and should be able to focus on the language. Accordingly, each lesson runs through the material covered in Excel, SAS Enterprise Guide, R and simply provides a few examples of how Python (and its libraries) can be used to achieve similar outcomes.

Learning basic Python and Jupyter notebooks

This tutorial is not meant as an introduction to Python. For that, you should search elsewhere on the web or watch an introduction like “Python Essential Training” on LinkedIn Learning (formerly Lynda.com). Note that the LinkedIn coverage of Python is vast. There are courses called “X Essential Training” for just about any topic X in the Python ecosystem (e.g., Pandas Essential Training, Python for Data Science Essential Training, and so on).

Recall that writing scripts (short programs) in R is much easier if you have a development environment like RStudio. Similarly, writing Python is much easier using an interactive notebook tool like Jupyter. Most of the LinkedIn tutorials start by getting you up and running with the Jupyter from Anaconda.

This gets a bit confusing:

  1. Python is the programming language

  2. Jupyter notebooks is the environment for writing and executing Python interactively (one or a few lines at a time)

  3. Anaconda is one of the distribution packages that provides Python, some standard Python libraries, Juptyter, and a bunch of other stuff.

The Jupyter notebook interface is very simple: it is a web page with interactive cells in which you type short snippets of Python. You then hit Shift-Enter to run the code and the results are shown immediately below. You can also enter plain text (called “markdown”) to document what you are doing or even write an entire document. This tutorial is written as a Jupyter notebook. The notebook metaphor is attractive: You write some notes to yourself, execute some code, generate some graphics, and everything is in one place, just like a physical notebook. The difference is that you can do some crazy-powerful things in an interactive Python notebook. And, of course, you can share your notebooks with others, so that they can use and build on what you have done.

Jupyter quick start

As noted above, if you are interested in using Python and Jupyter, you should spend a few moments going through one of the many excellent introductory tutorials out there. But if you absolutely need to get started without knowing much…

The only real trick to getting started with Jupyter is knowing where to save the notebooks files you create. For me, the easiest procedure is to go back to the old command-line days:

  1. Start a command line prompt

  2. Use operating system commands (cd) to navigate to the directory I want to start in

  3. Start Jupyter notebooks from that location

The specifics of this process depend a bit on your operating system and how you like to organize your files. But here is how I do it:

  1. (Download and install Anaconda, accept the installation defaults)

  2. Start the program called “Anaconda Prompt” or “Anaconda PowerShell Prompt”. This transports me back in time to 1988 by opening an unremarkable terminal window with a flashing cursor.

  3. Navigate to my preferred working directory (e.g. on Windows: cd "C:\Users\Michael Brydon\Documents\....\Notebooks")

  4. Start Jupyter notebooks by typing the program name: jupyter notebook

  5. From the home page, click “New” and “Python3” to create a new Python (version 3) notebook.