Spark: Big Data Cluster Computing in Production

A book by Ilya Ganelin, Ema Orhian, Kai Sasaki, Brennon York, ISBN

You should understand the basics of development and usage atop Apache Spark. This book will not be covering introductory material. There are numerous books, forums, and resources available that cover this topic and, as such, we assume all readers have basic Spark knowledge or, if duly lost, will read the interested topics to better understand the material presented in this book.

📗 See more Analysis books

Spark: Big Data Cluster Computing in Production by Ilya Ganelin, Ema Orhian, Kai Sasaki, Brennon York

Published in
$ Free
Average price
Times purchased

Spark: Big Data Cluster Computing in Production book PDF free download

About the Authors Ilya Ganelin is a roboticist turned data engineer. After a few years at the University of Michigan building self‐discovering robots and another few years work- ing on embedded DSP software with cell phones and radios at Boeing, he landed in the world of Big Data at the Capital One Data Innovation Lab. Ilya is an active contributor to the core components of Apache Spark and a committer to Apache Apex, with the goal of learn- ing what it takes to build a next‐generation distributed computing platform. Ilya is an avid bread maker and cook, skier, and race‐car driver. Ema Orhian is a passionate Big Data Engineer inter- ested in scaling algorithms. She is actively involved in the Big Data community, organizing and speaking at conferences, and contributing to open source projects. She is the main committer on jaws‐spark‐sql‐rest, a data warehouse explorer on top of Spark SQL. Ema has been working on bringing Big Data analytics into healthcare, developing an end‐to‐end pipeline for computing sta- tistical metrics on top of large datasets. Kai Sasaki is a Japanese software engineer who is interested in distributed computing and machine learn- ing. Although the beginning of his career didn’t start with Hadoop or Spark, his original interest toward middleware and fundamental technologies that sup- port a lot of these services and the Internet drives him toward this field. He has been a Spark contributor who develops mainly MLlib and ML libraries. Nowadays, he is trying to research the great potential of combining deep learning and Big Data. He believes that Spark can play a significant role even in artificial intelligence in the Big Data era. GitHub: Brennon York is an aerobatic pilot moonlighting as a computer scientist. His true loves are distributed computing, scalable architectures, and programming languages. He has been a core contributor to Apache Spark since 2014 with the goal of developing a stron- ger community and inspiring collaboration through development on GraphX and the core build environ- ment. He has had a relationship with Spark since his contributions began and has been taking applications into production with the framework since that time. About the Technical Editors Ted Yu is a Staff Engineer at HortonWorks. He is also an HBase PMC and Spark contributor and has been using/contributing to Spark for more than one year. Dan Osipov is a Principal Consultant at Applicative, LLC. He has been working with Spark for the last two years, and has been working in Scala for about four years, primarily with data tools and applications. Previously he was involved in mobile development and content management systems. Jeff Thompson is a neuro‐scientist turned data scientist with a PhD from UC Berkeley in vision science (primarily neuroscience and brain imaging), and a post‐doc at Boston University’s bio‐medical imaging center. He has spent a few years working at a homeland security startup as an algorithms engineer building next‐gen cargo screening systems. For the last two years he has been a senior data scientist at Bosch, a global engineering and manu- facturing company. Anant Asthana is a Big Data consultant and Data Scientist at Pythian. He has a background in device drivers and high availability/critical load database systems. Bernardo Palacio Gomez is a Consulting Member of the Technical Staff at Oracle on the Big Data Cloud Service Team. Gaspar Munoz works for Stratio ( as a product architect. Stratio was the first Big Data platform based on Spark, so he has worked with Spark since it was in the incubator. He has put into production several projects vii viii About the Technical Editors using Spark core, Streaming, and SQL for some of the most important banks in Spain. He has also contributed to Spark and the spark‐csv projects. Brian Gawalt received a Ph.D. in electrical engineering from UC Berkeley in 2012. Since then he has been working in Silicon Valley as a data scientist, spe- cializing in machine learning over large datasets. Adamos Loizou is a Java/Scala Developer at OVO Energy.