Big Data Analysis with Scala and Spark

Product type

Big Data Analysis with Scala and Spark

Coursera (CC)
Logo Coursera (CC)
Provider rating: starstarstarstar_halfstar_border 7.2 Coursera (CC) has an average rating of 7.2 (out of 6 reviews)

Need more information? Get more details on the site of the provider.

Description

When you enroll for courses through Coursera you get to choose for a paid plan or for a free plan

  • Free plan: No certicification and/or audit only. You will have access to all course materials except graded items.
  • Paid plan: Commit to earning a Certificate—it's a trusted, shareable way to showcase your new skills.

About this course: Manipulating big data distributed over a cluster using functional concepts is rampant in industry, and is arguably one of the first widespread industrial uses of functional ideas. This is evidenced by the popularity of MapReduce and Hadoop, and most recently Apache Spark, a fast, in-memory distributed collections framework written in Scala. In this course, we'll see how the data parallel paradigm can be extended to the distributed case, using Spark throughout. We'll cover Spark's programming model in detail, being careful to understand how and when it differs from familiar programming models, like shared-memory parallel collections or sequential Scala collections. Thr…

Read the complete description

Frequently asked questions

There are no frequently asked questions yet. If you have any more questions or need help, contact our customer service.

Didn't find what you were looking for? See also: PRINCE2, M&A (Mergers & Acquisitions), PRINCE2 Foundation, PRINCE2 Practitioner, and Retail (Management).

When you enroll for courses through Coursera you get to choose for a paid plan or for a free plan

  • Free plan: No certicification and/or audit only. You will have access to all course materials except graded items.
  • Paid plan: Commit to earning a Certificate—it's a trusted, shareable way to showcase your new skills.

About this course: Manipulating big data distributed over a cluster using functional concepts is rampant in industry, and is arguably one of the first widespread industrial uses of functional ideas. This is evidenced by the popularity of MapReduce and Hadoop, and most recently Apache Spark, a fast, in-memory distributed collections framework written in Scala. In this course, we'll see how the data parallel paradigm can be extended to the distributed case, using Spark throughout. We'll cover Spark's programming model in detail, being careful to understand how and when it differs from familiar programming models, like shared-memory parallel collections or sequential Scala collections. Through hands-on examples in Spark and Scala, we'll learn when important issues related to distribution like latency and network communication should be considered and how they can be addressed effectively for improved performance. Learning Outcomes. By the end of this course you will be able to: - read data from persistent storage and load it into Apache Spark, - manipulate data with Spark and Scala, - express algorithms for data analysis in a functional style, - recognize how to avoid shuffles and recomputation in Spark, Recommended background: You should have at least one year programming experience. Proficiency with Java or C# is ideal, but experience with other languages such as C/C++, Python, Javascript or Ruby is also sufficient. You should have some familiarity using the command line. This course is intended to be taken after Parallel Programming: https://www.coursera.org/learn/parprog1.

Created by:  École Polytechnique Fédérale de Lausanne
  • Taught by:  Dr. Heather Miller, Research Scientist

    EPFL
Basic Info Course 4 of 5 in the Functional Programming in Scala Specialization Language English How To Pass Pass all graded assignments to complete the course. User Ratings 4.6 stars Average User Rating 4.6See what learners said 课程作业

每门课程都像是一本互动的教科书,具有预先录制的视频、测验和项目。

来自同学的帮助

与其他成千上万的学生相联系,对想法进行辩论,讨论课程材料,并寻求帮助来掌握概念。

证书

获得正式认证的作业,并与朋友、同事和雇主分享您的成功。

École Polytechnique Fédérale de Lausanne

Syllabus


WEEK 1


Getting Started + Spark Basics



Get up and running with Scala on your computer. Complete an example assignment to familiarize yourself with our unique way of submitting assignments. In this week, we'll bridge the gap between data parallelism in the shared memory scenario (learned in the Parallel Programming course, prerequisite) and the distributed scenario. We'll look at important concerns that arise in distributed systems, like latency and failure. We'll go on to cover the basics of Spark, a functionally-oriented framework for big data processing in Scala. We'll end the first week by exercising what we learned about Spark by immediately getting our hands dirty analyzing a real-world data set.


7 videos, 5 readings expand


  1. 阅读: Tools setup
  2. 阅读: Eclipse tutorial
  3. 阅读: Intellij IDEA Tutorial
  4. 阅读: Sbt tutorial
  5. 阅读: Submitting solutions
  6. 未分级程序开发: Example
  7. Video: Introduction, Logistics, What You'll Learn
  8. Video: Data-Parallel to Distributed Data-Parallel
  9. Video: Latency
  10. Video: RDDs, Spark's Distributed Collection
  11. Video: RDDs: Transformation and Actions
  12. Video: Evaluation in Spark: Unlike Scala Collections!
  13. Video: Cluster Topology Matters!
  14. 未分级程序开发: Wikipedia

Graded: Wikipedia

WEEK 2


Reduction Operations & Distributed Key-Value Pairs
This week, we'll look at a special kind of RDD called pair RDDs. With this specialized kind of RDD in hand, we'll cover essential operations on large data sets, such as reductions and joins.


4 videos expand


  1. Video: Reduction Operations
  2. Video: Pair RDDs
  3. Video: Transformations and Actions on Pair RDDs
  4. Video: Joins
  5. 未分级程序开发: StackOverflow (2 week long assignment)

Graded: StackOverflow (2 week long assignment)

WEEK 3


Partitioning and Shuffling



This week we'll look at some of the performance implications of using operations like joins. Is it possible to get the same result without having to pay for the overhead of moving data over the network? We'll answer this question by delving into how we can partition our data to achieve better data locality, in turn optimizing some of our Spark jobs.


4 videos expand


  1. Video: Shuffling: What it is and why it's important
  2. Video: Partitioning
  3. Video: Optimizing with Partitioners
  4. Video: Wide vs Narrow Dependencies


WEEK 4


Structured data: SQL, Dataframes, and Datasets



With our newfound understanding of the cost of data movement in a Spark job, and some experience optimizing jobs for data locality last week, this week we'll focus on how we can more easily achieve similar optimizations. Can structured data help us? We'll look at Spark SQL and its powerful optimizer which uses structure to apply impressive optimizations. We'll move on to cover DataFrames and Datasets, which give us a way to mix RDDs with the powerful automatic optimizations behind Spark SQL.


5 videos expand


  1. Video: Structured vs Unstructured Data
  2. Video: Spark SQL
  3. Video: DataFrames (1)
  4. Video: DataFrames (2)
  5. Video: Datasets
  6. 未分级程序开发: Time Usage

Graded: Time Usage
There are no reviews yet.

    Share your review

    Do you have experience with this course? Submit your review and help other people make the right choice. As a thank you for your effort we will donate $1.- to Stichting Edukans.

    There are no frequently asked questions yet. If you have any more questions or need help, contact our customer service.