From banking and manufacturing through to education and entertainment, using data science for business has revolutionized almost every sector in the modern world. It has an important role to play in everything from app development to network security.
Taking an interactive approach to learning the fundamentals, this book is ideal for beginners. You'll learn all the best practices and techniques for applying data science in the context of real-world scenarios and examples.
Starting with an introduction to data science and machine learning, you'll start by getting to grips with Jupyter functionality and features. You'll use Python libraries like scikit-learn, pandas, Matplotlib, and Seaborn to perform data analysis and data preprocessing on real-world datasets from within your own Jupyter environment. Progressing through the chapters, you'll train classification models using scikit-learn, and assess model performance using advanced validation techniques. Towards the end, you'll use Jupyter Notebooks to document your research, build stakeholder reports, and even analyze web performance data.
By the end of The Applied Data Science Workshop, Second Edition, you'll be prepared to progress from being a beginner to taking your skills to the next level by confidently applying data science techniques and tools to real-world projects.
If you are an aspiring data scientist who wants to build a career in data science or a developer who wants to explore the applications of data science from scratch and analyze data in Jupyter using Python libraries, then this book is for you. Although a brief understanding of Python programming and machine learning is recommended to help you grasp the topics covered in the book more quickly, it is not mandatory.
Chapter 1, Introduction to Jupyter Notebooks, will get you started by explaining how to use the Jupyter Notebook and JupyterLab platforms. After going over the basics, we will discuss some fantastic features of Jupyter, which include tab completion, magic functions, and new additions to the JupyterLab interface. Finally, we will look at the Python libraries we'll be using in this book, such as pandas, seaborn, and scikit-learn.
Chapter 2, Data Exploration with Jupyter, is focused on exploratory analysis in a live Jupyter Notebook environment. Here, you will use visualizations such as scatter plots, histograms, and violin plots to deepen your understanding of the data. We will also walk through some simple modeling problems with scikit-learn.
Chapter 3, Preparing Data for Predictive Modeling, will enable you to plan a machine learning strategy and assess whether or not data is suitable for modeling. In addition to this, you'll learn about the process involved in preparing data for machine learning algorithms, and apply this process to sample datasets using pandas.
Chapter 4, Training Classification Models, will introduce classification algorithms such as SVMs, KNNs, and Random Forests. Using a real-world Human Resources analytics dataset, we'll train and compare models that predict whether an employee will leave their company. You'll learn about training models with scikit-learn and use decision boundary plots to see what overfitting looks like.
Chapter 5, Model Validation and Optimization, will give you hands-on experience with model testing and model selection concepts, including k-fold cross-validation and validation curves. Using these techniques, you'll learn how to optimize model parameters and compare model performance reliably. You will also learn how to implement dimensionality reduction techniques such as Principal Component Analysis (PCA).
Chapter 6, Web Scraping with Jupyter Notebooks, will focus on data acquisition from online sources such as web pages and APIs. You will see how data can be downloaded from the web using HTTP requests and HTML parsing. After collecting data in this way, you'll also revisit concepts learned in earlier chapters, such as data processing, analysis, visualization, and modeling.
Code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles are shown as follows:
"It's recommended to install some of these (such as mlxtend, watermark, and graphviz) ahead of time if you have access to an internet connection now. This can be done by opening a new Terminal window and running the pip or conda commands."
Words that you see on the screen (for example, in menus or dialog boxes) appear in the same format.
A block of code is set as follows:
pip install mlxtend
New terms and important words are shown like this:
"The focus of this chapter is to introduce Jupyter Notebooks—the data science tool that we will be using throughout the book."
Lines of code that span multiple lines are split using a backslash ( \ ). When the code is executed, Python will ignore the backslash, and treat the code on the next line as a direct continuation of the current line.
history = model.fit(X, y, epochs=100, batch_size=5, verbose=1, \
Comments are added into code to help explain specific bits of logic. Single-line comments are denoted using the # symbol, as follows:
# Print the sizes of the dataset
print("Number of Examples in the Dataset = ", X.shape)
print("Number of Features for each example = ", X.shape)
Multi-line comments are enclosed by triple quotes, as shown below:
Define a seed for the random number generator to ensure the
result will be reproducible
seed = 1
Before we explore the book in detail, we need to set up specific software and tools. In the following section, we shall see how to do that.
The easiest way to get up and running with this workshop is to install the Anaconda Python distribution. This can be done as follows:
- Navigate to the Anaconda downloads page from https://www.anaconda.com/.
- Download the most recent Python 3 distribution for your operating system – currently, the most stable version is Python 3.7.
- Open and run the installation package. If prompted, select yes for the option to Register Anaconda as my default Python.
pip comes pre-installed with Anaconda. Once Anaconda is installed on your machine, all the required libraries can be installed using pip, for example, pip install numpy. Alternatively, you can install all the required libraries using pip install –r requirements.txt. You can find the requirements.txt file at https://packt.live/2YBPK5y.
The exercises and activities will be executed in Jupyter Notebooks. Jupyter is a Python library and can be installed in the same way as the other Python libraries – that is, with pip install jupyter, but fortunately, it comes pre-installed with Anaconda. To open a notebook, simply run the command jupyter notebook in the Terminal or Command Prompt.
You'll be working on different exercises and activities using either the JupyterLab or Jupyter Notebook platforms. These exercises and activities can be downloaded from the associated GitHub repository.
Download the repository from https://packt.live/2zwhfom.
You can either clone it using git or download it as a zipped folder by clicking on the green Clone or download button in the upper-right corner.
In order to launch a Jupyter Notebook workbook, you should first use the Terminal to navigate to your source code. See the following, for example:
Once you are in the project directory, simply run jupyter lab to start up JupyterLab. Similarly, for Jupyter Notebook, run jupyter notebook.
You can find the complete code files of this book at https://packt.live/2zwhfom. You can also run many activities and exercises directly in your web browser by using the interactive lab environment at https://packt.live/3d6yr1A.
We've tried to support interactive versions of all activities and exercises, but we recommend a local installation as well for instances where this support isn't available.
If you have any issues or questions about installation, please email us at firstname.lastname@example.org.