Big Data

Credit

“Big data” refers to datasets whose size is beyond the ability of typical database software tools to capture, store, manage, and analyze.
- Ranges from a few dozen terabytes to multiple petabytes (thousands of terabytes)
- The data can be smaller but its complexity can be higher.
The trend towards "Big Data" is caused by a host of developments:
- The creation and storage of large data sets becomes feasible and economically viable.
- Technical advances for example in multi-core systems and cloud computing make it possible.
- Such amounts of data are now are created in many areas of life (e.g. sensor data)
Big data: The next frontier for innovation, competition, and productivity

Hardware limitations

CPU is the brains of a computer:
- Directs other components as well as runs mathematical calculations.
- Registers hold data that the CPU is working with at the moment (e.g. cumulative sum)
- Registers avoid having to send data back and forth between memory (RAM) and the CPU.
- 2.5 billion operations per second x 8 bytes per operation = 20 billion bytes per second.
- Most of the time, CPU is sitting idle while waiting for input data from RAM.
Memory takes 200x longer than the CPU:
- Known to be "efficient, expensive, and ephemeral (volatile)"
- Data stored in RAM gets erased once the computer shuts down.
- RAM is relatively expensive.
Magnetic disks can be 200x slower and SSDs can be 15x slower than RAM.
Network transfer takes 20x longer than SSDs.
- Transferring data across a network is the biggest bottleneck when working with big data.
- Distributed systems try to minimize shuffling data back and forth between nodes.
Latency Numbers Every Programmer Should Know
Even if the entire dataset cannot fit into the RAM, it can still be processed chunk wise.
- Iterating through files chunk by chunk

The five V's

Volume:
- The sheer volume of data that is produced each day (petabytes, exabytes, zettabytes)
- Cannot no longer be saved or analyzed using conventional data processing methods.
Velocity:
- Speed with which the data is generated, analyzed and reprocessed.
Variety:
- Diversity of data types and data sources.
- 80% of the data in the world today is unstructured.
Veracity:
- “garbage in, garbage out”
- For big data systems to be reliable and usable, the data has to be also accurate.
Value:
- Added value for companies.
- It's a question of generating business value from their investments.

Data Engineering

Companies today need to be able to reliably store, process and query its huge inflows.
As a result, the data infrastructure needs to be distributed, scalable (petabytes) and reliable.
Data engineering thinks about the end-to-end process as “data pipelines“.

Data engineers

Data engineers are highly specialized software engineers who create and maintain robust big data pipelines.
Together with data scientists who analyze the data, they form the basis of the data teams.
- Data engineers are the experts on which data scientists depend in order to be able to work.
- They are in charge of pulling, cleaning and loading the data into data stores.
- Over the years though, terminology and roles have become harder to separate.
- Data engineering is much closer to software engineering than it is to a data science.
Data engineers are one of the most in-demand job roles at today’s leading companies.
The tool set of a data engineer includes:
- Hadoop, Spark, Python, Scala, Java, C++, SQL, AWS/Redshift, Azure
- To stay marketable, keeping up to date is more important than ever.
- A Turbulent Year: The 2019 Data & AI Landscape
- The Rise of the Data Engineer
In smaller environments:
- Data engineers usually set up and operate platforms like Hadoop/Hive/HBase, Spark, and the like.
- Use hosted services offered by Amazon or Databricks.
- Get support from companies like Cloudera or Hortonworks.

Evolution of data engineering

Data Engineering Introduction and Epochs
Hadoop:
- With Hadoop open-sourced in 2006, it became easier and cheaper to store large amount of data.
- Hadoop (unlike traditional RDBMS databases) did not require the data to be structured.
- Development of Map Reduce jobs in Java forced the emergence of Backend Engineers.
- That's until Hive was open sourced in 2010.
Data orchestration enginees:
- Companies faced with challenge to operate complex data flows without any tools.
- Spotify open-sourced Luigi in 2012.
- Airbnb open-sourced Airflow in 2015 (inspired by a similar system at Facebook).
- Based on the traction in the PyData ecosystem, most orchestration engines were built with Python.
- Python has become the favorite programming language for Production Engineers.
Machine learning:
- With enormous growth in data, machine learning quickly gained traction.
- Until the advent of Hadoop, ML models were usually trained on a single machine.
- In the early days of Hadoop, ML models required some advanced software engineering knowledge (e.g. use of frameworks such as Mahout upon MapReduce)
- Many Backend Engineers have become Machine Learning Engineers.
- Advancements in sklearn and orchestration engines made production-ready workflows easier.
Spark and real-time processing:
- The Spark’s MLlib in 2014 democratized ML computation on Big Data.
- Spark further offered a way for data engineers to easily process streaming data.
Cloud development and serverless architecture:
- AWS was officially launched in 2006.
- Elastic MapReduce (2009) made it easier to dynamically spin up and scale Hadoop clusters.
- The cloud made storage and compute essentially infinite.
- Elasticity of the cloud made it much easier to handle high peak batch jobs.
- But it came at the cost of having to manage infrastructure and the scaling process through code.
- Lambda function (2014) kicked off the serverless movement.
- Data now could be easily ingested without managing infrastructure.

Designing systems

Thinking about requirements

Always approach the problem from the standpoint of your end user.
Start with the end user's needs and work backwards towards where the data is coming from.
- Sometimes you need to meet in the middle.
What sort of access patterns do you anticipate from your end users?
- Analytical queries that span large time ranges?
- Huge amounts of small transactions for very specific rows of data?
What availability do these users demand?
- How quickly does a response need to be?
- Milliseconds? HBase, Cassandra.
What consistency do these users demand?
Just how big is your big data?
- Do you really need a cluster?
Timeliness:
- Can queries be based on day-old data? Minute-old? Scheduled jobs.
- Or must they be (near) real-time? Spark Streaming, Storm or Flink with Kafka or Flume.
How much internal infrastructure and expertise is available?
- Do systems you already know fit the bill?
- Should you use proprietary solutions or the cloud technology?
- Think carefully, since moving it will be difficult later on.
- If relaxing a "requirement" saves money and time - at least ask the manager.
Does your organization have existing components you can use?
- What's the least amount of infrastructure you need to build?
- Don't build a new DWH if you already have one.
- Rebuilding existing solution may have a negative business value.
What about data retention?
- Do you need to keep the data around forever, for example, for auditing?
- Or do you need to purge it often, for example, for privacy?
What about security?
- Make sure you're in compliance with any regulations in the countries you'll operate in.

Example

A system to track and display the top 10 best-selling items on an e-commerce website.
What are the requirements? Work backwards!
There are millions of end users, generating thousands of queries per second.
- Page latency is important so it must be fast.
- Some distributed NoSQL solution would fit here (such as Apache Cassandra)
- Access patten is simple: "Give me the current top N sellers in category X"
- Hourly updates are probably good enough.
- Consistency is not important.
- Must be highly available (customers don't like broken websites)

How does the data get into Cassandra?
- Apache Spark can talk to Cassandra.
- And Spark can add things up over windows.

How does the data get into Spark?
- Kafka or Flume - either work.
- Flume is purpose-built for HDFS, which may or may not need.
- But Flume is also purpose-built for log ingestion.
- There may be already a Log4j interceptor on the servers that process purchases available.

Maybe you already have an existing purchase database.
- Instead of streaming, hourly batch jobs may also do the trick.
Purchase data is sensitive - get a security review.
- Blasting around raw logs that include PII is a bad idea.
- Strip out data you don't need at the source.
- Some database or publisher may be involved where PII is scrubbed.
Security considerations may even force you into a totally different design.

datadocs

General

Methods

Features

Optimization

Competitions

Production

General

Fundamentals

Computer Vision

NLP

General

Databases

Hadoop

General

AWS

Hardware limitations

The five V's

Data Engineering

Data engineers

Evolution of data engineering

Designing systems

Thinking about requirements

Example