Welcome to the book! When reading the table of contents, you probably noticed the diversity of the topics we’re about to cover. The goal of Introducing Data Science is to provide you with a little bit of everything—enough to get you started. Data sci- ence is a very wide field, so wide indeed that a book ten times the size of this one wouldn’t be able to cover it all. For each chapter, we picked a different aspect we find interesting. Some hard decisions had to be made to keep this book from col- lapsing your bookshelf!
We hope it serves as an entry point—your doorway into the exciting world of data science.
Roadmap
Chapters 1 and 2 offer the general theoretical background and framework necessary to understand the rest of this book:
■ Chapter 1 is an introduction to data science and big data, ending with a practi- cal example of Hadoop.
■ Chapter 2 is all about the data science process, covering the steps present in almost every data science project.
xvi
ABOUT THIS BOOK xvii In chapters 3 through 5, we apply machine learning on increasingly large data sets:
■ Chapter 3 keeps it small. The data still fits easily into an average computer’s memory.
■ Chapter 4 increases the challenge by looking at “large data.” This data fits on your machine, but fitting it into RAM is hard, making it a challenge to process without a computing cluster.
■ Chapter 5 finally looks at big data. For this we can’t get around working with multiple computers.
Chapters 6 through 9 touch on several interesting subjects in data science in a more- or-less independent matter:
■ Chapter 6 looks at NoSQL and how it differs from the relational databases.
■ Chapter 7 applies data science to streaming data. Here the main problem is not size, but rather the speed at which data is generated and old data becomes
obsolete.
■ Chapter 8 is all about text mining. Not all data starts off as numbers. Text min-
ing and text analytics become important when the data is in textual formats
such as emails, blogs, websites, and so on.
■ Chapter 9 focuses on the last part of the data science process—data visualization
and prototype application building—by introducing a few useful HTML5 tools.
Appendixes A–D cover the installation and setup of the Elasticsearch, Neo4j, and MySQL databases described in the chapters and of Anaconda, a Python code package that's especially useful for data science.
Whom this book is for
This book is an introduction to the field of data science. Seasoned data scientists will see that we only scratch the surface of some topics. For our other readers, there are some prerequisites for you to fully enjoy the book. A minimal understanding of SQL, Python, HTML5, and statistics or machine learning is recommended before you dive into the practical examples.
Code conventions and downloads
We opted to use the Python script for the practical examples in this book. Over the past decade, Python has developed into a much respected and widely used data sci- ence language.
The code itself is presented in a fixed-width font like this to separate it from ordinary text. Code annotations accompany many of the listings, highlighting impor- tant concepts.
The book contains many code examples, most of which are available in the online code base, which can be found at the book’s website, https://www.manning.com/ books/introducing-data-science.
about the authors
DAVY CIELEN is an experienced entrepreneur, book author, and professor. He is the co-owner with Arno and Mo of Optimately and Maiton, two data science companies based in Belgium and the UK, respectively, and co-owner of a third data science com- pany based in Somaliland. The main focus of these companies is on strategic big data science, and they are occasionally consulted by many large companies. Davy is an adjunct professor at the IESEG School of Management in Lille, France, where he is involved in teaching and research in the field of big data science.
ARNO MEYSMAN is a driven entrepreneur and data scientist. He is the co-owner with Davy and Mo of Optimately and Maiton, two data science companies based in Belgium and the UK, respec- tively, and co-owner of a third data science company based in Somaliland. The main focus of these companies is on strategic big data science, and they are occasionally consulted by many large companies. Arno is a data scientist with a wide spectrum of interests, ranging from medical analysis to retail to game analytics. He believes insights from data combined with some imagination can go a long way toward helping us to improve this world.