Dev Intern Study Notes: Big Data
(Updated 20220514) Sharing the notes from our developer induction session with our interns on Big Data. We are still editing this and adding some diagrams, links and other references. Can use it as a collection of key concepts for now.
Our interns spent a week or so researching into the topic with some starting questions, we met to discuss the topic for an hour or so, and then a week or so to write up the topic. Improving on notes from previous sessions. We shared them here not as a full article, but an online notepad for the topic. Please send us your comments or suggestions on how to improve these articles.
These are the main points we set out for discussion.
- What is big data? How big?
- Why is it important? What are the big implications?
- Find examples of 3–5 big data usage that impact our everyday life.
- What are the common software or tools for big data?
Harsh Introduction: If you can imagine it, it probably isn’t big data. :)
Some credit CERN nuclear physicists for pushing the whole NoSQL popularisation with their need of storing petabytes of information in real time. Now if you need time to think about how many hard disks or servers you need to store this, you are probably not using big data. If you are working at at a typical startup or an ordinary SME, you are not using Big Data.
But still, this is not to put you off from learning about the topic. It’s kind of like saying “the sky is the limit”, used to mean there are no limits, and now try that with Elon Musk.
Of course, data size grows over time, gigabytes or terabytes were consider massive 20 years ago. *joke* A company called Teradata, started in 1979, now looks petty or just as common as my iPhone 1TB. But with this growth in mind, Big Data can still be roughly defined as, if you still think you can host it yourself, you are probably not using it unless you are in the FAAMG gang. (Read this, an article from Facebook in 2009, describing how it had to invent its own solution for storing photos, if you don’t have this kind of problem, you are probably not that big.)
That wasn’t aimed at mocking the many startups out there in the big data space, when they aim to serve many business clients, they are certainly much bigger than typical b2c startups.
Just that most people talk about big data without being able to put some numbers to it. Most solutions will not have billions of users or billions of transactions per day. It’s not that we are not at an exponential age, but most people can’t even handle linear growth properly whilst bs-ting about big data.
How did tiktok grow so quickly? Partly because technologies and methods for dealing with a billion or so users were already available for free, thanks to the previous generations of top tech players driving such innovations.
Big Data has been one of the most often discussed topics in the IT industry in recent years, and it has already had a huge impact on our lives: Amazon recommends products based on users’ previous searches and views; Tiktok uses your views, likes, and comments to predict users’ preferences and recommends the next videos; financial institutions use big data and machine learning models to forecast financial product prices in order to generate revenue, and so on. We frequently hear the term “big data,” but what does it actually mean and why is it so critical?
While giving a 100% accurate definition on what is big data is hard, big data generally just refers to extremely large datasets. A National Institute of Standards and Technology report defined big data as consisting of “extensive datasets — primarily in the characteristics of volume, velocity, and/or variability — that require a scalable architecture for efficient storage, manipulation, and analysis.” IBM’s definition is “data sets whose size or type is beyond the ability of traditional relational databases to capture, manage and process the data with low latency.”
We used to think about data mainly as just structured data with tables and rows in the SQL days. But computer has become so powerful compared to before, we can now process unstructured data efficiently too. The rise of NoSQL helped to drive big data even further.
The 3Vs are still important in processing considering Big Data solutions, just need to stay flexible with the ever growing scale.
- Volume: The huge amounts of data being stored.
- Velocity: The lightning speed at which data streams must be processed and analyzed.
- Variety: The different sources and forms from which data is collected, such as numbers, text, video, images, audio and text.
While is above description give a general glimpse of what big data is, how big is big data? Is there a set number of it? Well, there is no set answer. But generally, traditional data is measured in sizes like megabytes, gigabytes and terabytes, while big data is stored in petabytes and zettabytes. Generally speaking, (unless you work for such big data company) if it is still something that you can still consider hosting in office or at home, it isn’t really big data, and certainly not growing fast enough.
Let’s also learn a little about the different words for big data sizes:
1 Bit = Binary Digit; 8 Bits = 1 Byte; 1000 Bytes = 1 Kilobyte; 1000 Kilobytes = 1 Megabyte; 1000 Megabytes = 1 Gigabyte; 1000 Gigabytes = 1 Terabyte; 1000 Terabytes = 1 Petabyte; 1000 Petabytes = 1 Exabyte; 1000 Exabytes = 1 Zettabyte; 1000 Zettabytes = 1 Yottabyte; 1000 Yottabytes = 1 Brontobyte; 1000 Brontobytes = 1 Geopbyte
There are words in Chinese that represents even larger units:
We could see that those units are really big, compared to the units that we are more familiar with like MB, GB, TB. And the data growth in the past 20–30 years is really substantial. Moore’s Law stated that “the number of transistors that can be packed into a given unit of space will double about every two years.” Today, however, the doubling of installed transistors on silicon chips occurs at a pace faster than every two years. And Moore’s law also word for data capacity: Kryder’s law predicted that the doubling of disk density on one inch of magnetic storage would take place once every thirteen months. Here is the graph from statista.com, showing the increasing in the data volume from 2010:
As you can see from the graph, the increase in data volume is significant. And the use of those data has changed our lives on a lot of aspects. Big data is important because is provides vast opportunities for businesses. It could be used independently or with traditional data, to be the basis of advanced analytics and help make sound business decisions. Some common techniques include data mining, text analytics, predictive analytics, data visualization, AI, machine learning, statistics and natural language processing. With big data analytics, you can ultimately fuel better and faster decision-making, modelling and predicting of future outcomes and enhanced business intelligence.
Common tools like Hadoop, R, Tableau, HPCC, Storm (Apache), CouchDB, etc. are used to help big data analysis. Let’s take a look on the one of the most popular tools for big data analysis: Hadoop.
Hadoop is an open-source software framework for storing data and running applications on clusters of commodity hardware. It provides massive storage for any kind of data, enormous processing power and the ability to handle virtually limitless concurrent tasks or jobs. There are 3 steps in the process:
- Storage: HDFS (data is distributed on many computers. There are many blocks, and each block is 128 MB in size). HDFS make copies of data and stored it across multiple systems. (Replication method)
- MapReduce: in the past, all data is process on 1 computer. Hadoop divides data in to different sets and process them on different computers. MapReduce process each part of the data individually and sums result in the end.
- Run on Hadoop.
Besides Hadoop, there are also other technologies that are essential to big data:
Kafka is used for real-time streams of data, to collect big data, or to do real time analysis (or both). Kafka is a transportation mechanism, and it is good at making data moving really fast and on scale for a company. For example, Netflix uses Kafka to apply recommendations in real-time while you are watching TV shows. Uber uses Kafka to gather data on user, cars and trip in real-time to compute and predict demand and compute price. LinkedIn uses Kafka to collect user interaction data in real-time to make connection recommendations. In addition, Kafka also have consensus ordering service that helps hyper ledger to order nodes.
Raft is also a consensus algorithm. Consensus means multiple servers agreeing on same information, something imperative to design fault-tolerant distributed systems. Hyper ledger fabric to shifting from Kafka to Raft for ordering.
ZooKeeper is essentially a service for distributed systems offering a hierarchical key-value store, which is used to provide a distributed configuration service, synchronization service, and naming registry for large distributed systems, it offers:
- Configuration management
- Self-selection/consensus building
- Internal structure is like a tree
Zookeeper is important for not only Kafka, but also many other distributed system. It looks after the partitioning of these horizontally-scaled systems. The data within Zookeeper is divided across multiple collection of nodes and this is how it achieves its high availability and consistency. In case a node fails, Zookeeper can perform instant failover migration; e.g. if a leader node fails, a new one is selected in real-time by polling within an ensemble. A client connecting to the server can query a different node if the first one fails to respond. You can watch a zNode, and whenever the node changes, you would be notified.
Note that in public chain like Bitcoin blockchain, the ordering is done by the decentralized system and requires PoW, PoS, etc, which creates a mess sometimes. However, in Hyper ledge fabric, the ordering service using Kafka, Raft and Zookeeper, is centralized, so it is faster.
Back to the topic of big data, as mentioned above, zookeeper looks after the partitioning of the horizontally-scaled systems. So when the amount of data gets larger and larger, you will need to scale up in order to handle the large capacity of data. And there are two ways of scaling: vertical scaling and horizontal scaling.
Horizontal scaling means that you scale by adding more machines into your pool of resources whereas Vertical scaling means that you scale by adding more power (CPU, RAM) to an existing machine. Good examples of horizontal scaling are Cassandra, MongoDB, Google Cloud Spanner .. and a good example of vertical scaling is MySQL — Amazon RDS (The cloud version of MySQL).
Finally, are there any limitations on big data? The answer is yes:
- Over-reliance on big data: big data often leads managers to rely too much on the data and abdicate decision making. Using what the data tells you to inform a considered decision is of course commendable, but blindly following it without question or leaving room for experience and gut instincts can lead to poor decisions.
- Data quality varies
- Cybersecurity risks: Storing big data, particularly sensitive data, can make companies a more attractive target for cyberattackers.
- Compliance with the government law
- Privacy concerns
To sum up, big data has already impact our daily lives and will continue to have great impacts. It has utilized different types of technologies, and having a general understanding of them would be helpful for us when we discuss big data.
Some Database Knowledge:
Database transaction: a transaction is an operation on the database that you want to treat a whole. For example, when A transfer some tokens to B, A would first withdraw some money from its account and then deposit it into B’s account. This is a whole process, and the operations is only successful when it process is fully done. If it stops at any point halfway, it is not good and the money might be lost.
A transaction need to be ACID:
- Atomic — All operations in a transaction succeed or every operation is rolled back.
- Consistent — On the completion of a transaction, the database is structurally sound.
- Isolated — Transactions do not contend with one another. Contentious access to data is moderated by the database so that transactions appear to run sequentially.
- Durable — The results of applying a transaction are permanent, even in the presence of failures.
- Uniquely Identify each row (with a primary key)
- Multiple value columns must be separated, so that there is only one value in each column per row
- Consistent data type must be enforced for each column
- 2NF: Every non-key column in a table must depend on the value of the key
- 3NF: Every non-key column must ONLY depend on the value of the key