The ability to work with big data can produce high business value capabilities (e.g. recommendation engines, predictive maintenance). Hadoop is a frequent enterprise choice for a big data platform. In this class we will look at on-premise Hadoop platforms to store structured and unstructured data as well as to run data science models.
- What is Big Data
- We review why big data is of interest and what features help identify big data.
- Why Work with Big Data
- Demo of Machine Learning to illustrate what is possible with big data.
- Big Data Terminology
- Learners are introduced to terms and concepts that makeup a big data platform. Hands on interaction with data sets will help solidify the connection of these terms with big data integration and data science jobs.
- Open Source and Commercial Platforms
- Review available open source options. Examine analyst reports on commercial big data platforms, review architectures to understand what is being offered by the leading commercial vendors. Followed by a group discussion of learner experiences with Hadoop.
- Big Data Integration
- Discuss hand coding and browser-based tools offered by Cloudera and MapR to integrate data onto one node virtual sandboxes. Hands on exercises will involve batch and real time integration of structured and unstructured data on both platforms.
- Running in Spark
- Further discussion on why we are interested in Spark (e.g. Spark jobs being up to 100x speed increase over traditional systems). Hands on exercises involving a couple introductory Spark jobs.
At the end of this program, learners will be able to:
- Define in more detail Big Data, Hadoop, and situations where Hadoop is an appropriate tool.
- Help select and utilize on-premise Hadoop platforms for big data work.
- Utilize Cloudera and MapR tools for common big data integration.
- Build basic Spark applications.
Predictive and/or Prescriptive course(s) will help in understanding the potential of Spark modeling; however, these are not required.