Machine Learning for Big Data (EN)

course

Pumpedu s.r.o.

You will be guided by Mojmír Vinkler

Information

Description

The aim of this course is to present an overview of tools and concepts from machine learning on big data.

After going through the course participants should be able to tell what is the right tool to use for the given problem, whether there is a simpler solution and how to avoid common mistakes. Special attention will be given to Spark as a universal tool that can be used for both big data processing and machine learning.

Overview of Big Data concepts and tools
- From small to big data and estimating its value
- Row vs column-oriented database
- HDFS (Hadoop Distributed File System)
- Big data file formats – Parquet, ORC, Avro
- Compression – gzip, snappy, zstd
- SQL databases – BigQuery, Redshift, Clickhouse, Snowflake, Vertica
A practical example of big data value proposition
Introduction to Spark
- MapReduce
- Spark Computing Engine and RDDs (Resilient Distributed Datasets)
- DataFrames
- Spark Ecosystem
- Most common Spark mistakes
- How to run Spark
- Alternatives – Apache Beam (Dataflow), Dask, lambdas
A practical example with Spark
ML strategies for Big Data
- Incremental learning
- Batch learning for neural networks
- Distributed training
- Federated learning
- Alternative strategies
  - Random sampling
  - Submodels
  - Larger workstation
Frameworks
- Scikit-learn with partial_fit
- MLlib
- Dask-ML
Practical examples with various frameworks
Common mistakes

Prerequisites

Basics of Python and working in Google Colab
Basics of machine learning on the level of our course Introduction to machine Learning

Detailed description on provider's web site