Big Data Essentials: What Developers Need to Know

Edward Tsang
14 min readSep 2, 2021

--

CC BY-SA 3.0 Original Link

Sharing notes from our Big Data study sessions with new developers. We run these induction sessions when we have new coder joining our team, covering key tech topics all coders should know like AI, Blockchain, Cloud and Big Data, as well as other hot topics like Crypto, Metaverse, NFT, etc. They spent a week researching into the topic with some starting questions, then meet up to discuss the topic for an hour or so, followed by a week or so to write up the topic in their own words. We will update and edit these notes every now and then, and we welcome your feedback and comments. Thanks.

About Us: OnDemandWorld team, we are working on a new blockchain and AI app solutions. Niftnack NFT and TechHub Jobs. MVP already on App Store and Google Play. We are building the first scalable version now. Looking for new junior devs, blockchain devs, full-stack devs, etc. Contact us for more information. Remote working mostly in 2022–23 unless you are in Shanghai already.

(First week, aim at around 13 hours or so in total, to research into the topic. Try to answer the following questions. Jot down notes for the meeting. 5–10 Pages.)

Tech Topic 3: Big Data

What is big data? How big?

  • Why is it important? What are the big implications?
  • Find examples of 3–5 big data usage that impact our everyday life.
  • What are the common software or tools for big data?
  • How to become a data scientist?

(And here are our notes and summary to the questions above, covered in about an hour or so of discussion. Still editing these. Feel free to leave us with comments and suggestions, thanks.)

Harsh Introduction: If you can imagine it, it probably isn’t big data. :)

Big Data has been one of the most often discussed topics in the IT industry in recent years, and it has already had a huge impact on our lives: Amazon recommends products based on users’ previous searches and views; Tiktok uses your views, likes, and comments to predict users’ preferences and recommends the next videos; financial institutions use big data and machine learning models to forecast financial product prices in order to generate revenue, and so on. We frequently hear the term “big data,” but what does it actually mean and why is it so critical?

Some credit CERN nuclear physicists for pushing the whole NoSQL popularisation with their need of storing petabytes of information in real time. Now if you need time to think about how many hard disks or servers you need to store this, you are probably not using big data. If you are working at at a typical startup or an ordinary SME, you are not using Big Data.

But still, this is not to put you off from learning about the topic. It’s kind of like saying “the sky is the limit”, used to mean there are no limits, and now try that with Elon Musk.

What is big data? How big?

Big data is a term used to describe extremely large and complex datasets that are difficult to process using traditional data processing techniques. These datasets are often generated from various sources such as social media, internet search engines, sensors, and other types of digital devices.

The size of big data can vary depending on the context, but it generally refers to datasets that are too large to be processed by traditional database management systems or analytical tools. Big data is typically characterized by its volume, velocity, and variety, which are commonly referred to as the “3Vs” of big data. A National Institute of Standards and Technology report defined big data as consisting of “extensive datasets — primarily in the characteristics of volume, velocity, and/or variability — that require a scalable architecture for efficient storage, manipulation, and analysis.” IBM’s definition is “data sets whose size or type is beyond the ability of traditional relational databases to capture, manage and process the data with low latency.”

The volume of big data can range from terabytes to petabytes or even exabytes, which represents an enormous amount of data that is difficult to store, manage, and analyze using traditional methods. The velocity of big data refers to the speed at which the data is generated and must be processed in order to derive value from it. This can range from real-time streaming data to batch processing of large datasets. Finally, the variety of big data refers to the different types and formats of data that are being generated, such as text, images, videos, and structured and unstructured data.

Overall, the size of big data is constantly growing as more and more data is being generated every day, and it presents both challenges and opportunities for businesses and organizations that are looking to extract insights and value from this data.

For reference, the datalake for LLM training can be 454 TiB or more. Most individuals and small companies

  1. Common Crawl, the datalake needed to train your AI. https://www.commoncrawl.org/

Why is it important? What are the big implications?

Big data is important for several reasons:

  1. Better decision making: Big data can provide valuable insights that can help organizations make better decisions. By analyzing large and complex datasets, organizations can identify patterns, trends, and correlations that may not be apparent using traditional data analysis techniques.
  2. Improved efficiency: Big data can help organizations optimize their operations and improve efficiency by identifying bottlenecks, predicting maintenance needs, and optimizing resource allocation.
  3. New business models: Big data can enable organizations to create new business models and revenue streams. For example, companies like Uber and Airbnb use big data to match supply and demand and create new opportunities in the sharing economy.
  4. Personalization: Big data can help organizations personalize their products and services based on individual customer preferences and behavior.

The implications of big data are significant and far-reaching. Some of the key implications include:

  1. Privacy concerns: The collection and analysis of large amounts of personal data raise privacy concerns. Organizations must be transparent about their data collection practices and take steps to protect individuals’ privacy.
  2. Skills gap: Analyzing and interpreting big data requires specialized skills in areas like data science, machine learning, and statistics. There is a growing skills gap in these areas, which may limit organizations’ ability to fully capitalize on the potential of big data.
  3. Regulation: Governments around the world are implementing regulations to protect individuals’ privacy and ensure that organizations are using data ethically and responsibly. Organizations must comply with these regulations or face penalties.
  4. Digital divide: The ability to analyze and interpret big data requires access to technology and high-speed internet, which may create a digital divide between those who have access to these resources and those who do not.

Of course, data size grows over time, gigabytes or terabytes were consider massive 20 years ago. *joke* A company called Teradata, started in 1979, now looks petty or just as common as my iPhone 1TB. But with this growth in mind, Big Data can still be roughly defined as, if you still think you can host it yourself, you are probably not using it unless you are in the FAAMG gang. (Read this, an article from Facebook in 2009, describing how it had to invent its own solution for storing photos, if you don’t have this kind of problem, you are probably not that big.)

That wasn’t aimed at mocking the many startups out there in the big data space, when they aim to serve many business clients, they are certainly much bigger than typical b2c startups.

Just that most people talk about big data without being able to put some numbers to it. Most solutions will not have billions of users or billions of transactions per day. It’s not that we are not in an exponential age, but most people can’t even handle linear growth properly whilst bs-ting about big data.

How did tiktok grow so quickly? Partly because technologies and methods for dealing with a billion or so users were already available for free, thanks to the previous generations of top tech players driving such innovations.

We used to think about data mainly as just structured data with tables and rows in the SQL days. But computer has become so powerful compared to before, we can now process unstructured data efficiently too. The rise of NoSQL helped to drive big data even further.

While is above description give a general glimpse of what big data is, how big is big data? Is there a set number of it? Well, there is no set answer. But generally, traditional data is measured in sizes like megabytes, gigabytes and terabytes, while big data is stored in petabytes and zettabytes. Generally speaking, (unless you work for such big data company) if it is still something that you can still consider hosting in office or at home, it isn’t really big data, and certainly not growing fast enough.

Let’s also learn a little about the different words for big data sizes:

1 Bit = Binary Digit; 8 Bits = 1 Byte; 1000 Bytes = 1 Kilobyte; 1000 Kilobytes = 1 Megabyte; 1000 Megabytes = 1 Gigabyte; 1000 Gigabytes = 1 Terabyte; 1000 Terabytes = 1 Petabyte; 1000 Petabytes = 1 Exabyte; 1000 Exabytes = 1 Zettabyte; 1000 Zettabytes = 1 Yottabyte; 1000 Yottabytes = 1 Brontobyte; 1000 Brontobytes = 1 Geopbyte

There are words in Chinese that represents even larger units:

:代表的是10的十二次方。:代表的是10的十六次方。:代表的是10的二十次方。:代表的是10的二十四次方。:代表的是10的二十八次方。:代表的是10的三十二次方。:代表的是10的三十六次方。无量:代表的是10的六十八次方。大数:代表的是10的七十二次方。

We could see that those units are really big, compared to the units that we are more familiar with like MB, GB, TB. And the data growth in the past 20–30 years is really substantial. Moore’s Law stated that “the number of transistors that can be packed into a given unit of space will double about every two years.” Today, however, the doubling of installed transistors on silicon chips occurs at a pace faster than every two years. And Moore’s law also word for data capacity: Kryder’s law predicted that the doubling of disk density on one inch of magnetic storage would take place once every thirteen months. Here is the graph from statista.com, showing the increasing in the data volume from 2010:

Reference: https://www.statista.com/statistics/871513/worldwide-data-created/

As you can see from the graph, the increase in data volume is significant. And the use of those data has changed our lives on a lot of aspects. Big data is important because is provides vast opportunities for businesses. It could be used independently or with traditional data, to be the basis of advanced analytics and help make sound business decisions. Some common techniques include data mining, text analytics, predictive analytics, data visualization, AI, machine learning, statistics and natural language processing. With big data analytics, you can ultimately fuel better and faster decision-making, modelling and predicting of future outcomes and enhanced business intelligence.

Big data usage that impact our everyday life.

Here are some examples of big data usage that impact our everyday lives:

  1. Online advertising: Companies like Google and Facebook use big data to deliver personalized advertisements based on users’ browsing history, search history, and other online activity. This is why you may see ads for products or services that are relevant to your interests or needs.
  2. Health care: Big data is being used in healthcare to improve patient outcomes, reduce costs, and identify new treatments. For example, wearable devices and health apps can collect data on patients’ physical activity, sleep patterns, and vital signs, which can be used to monitor their health and identify potential health issues before they become more serious.
  3. Transportation: Big data is being used in transportation to improve traffic flow, reduce congestion, and improve safety. For example, traffic sensors and cameras can collect data on traffic patterns, which can be used to optimize traffic signals and reduce wait times at intersections.
  4. E-commerce: Online retailers like Amazon use big data to analyze customer behavior and provide personalized product recommendations. This is why you may see product recommendations on Amazon that are similar to products you have previously purchased or viewed.
  5. Energy management: Big data is being used in energy management to optimize energy consumption and reduce costs. For example, smart meters can collect data on energy usage, which can be used to identify areas where energy consumption can be reduced and to predict energy demand.

What are the common software or tools for big data?

There are several software and tools that are commonly used for processing, analyzing, and visualizing big data.

  1. Hadoop: Hadoop is an open-source framework for distributed storage and processing of large datasets. It provides a scalable and fault-tolerant platform for processing big data using the MapReduce programming model.
  2. Spark: Apache Spark is an open-source framework for distributed data processing that can run in-memory, making it faster than Hadoop. It provides an easy-to-use API for processing large datasets and supports various programming languages, including Java, Scala, and Python.
  3. NoSQL databases: NoSQL databases are designed to handle large volumes of unstructured or semi-structured data. Examples include MongoDB, Cassandra, and Couchbase.
  4. Data warehouses: Data warehouses are used for storing and analyzing structured data. Examples include Amazon Redshift, Google BigQuery, and Snowflake.
  5. Visualization tools: Visualization tools are used to create visual representations of data that are easier to understand and interpret. Examples include Tableau, QlikView, and Microsoft Power BI.
  6. Machine learning frameworks: Machine learning frameworks are used for building and training models that can be used for predictive analytics and other applications. Examples include TensorFlow, Keras, and scikit-learn.

Horizontal and Vertical scaling

Horizontal scaling (also known as scaling out) involves adding more nodes or machines to a distributed system, such as a cluster, to increase its processing power and storage capacity. This is achieved by distributing the workload across multiple machines, each of which performs a subset of the overall tasks. In other words, the system scales horizontally by adding more machines that work in parallel.

In general, horizontal scaling is often preferred in big data scenarios where the workload is highly parallelizable and can be easily divided into smaller subtasks that can be processed independently across multiple machines. Most BASE transactions are suitable for thie type of scaling. This approach allows for greater scalability, fault tolerance, and cost-effectiveness, as additional machines can be added as needed to handle increasing workloads.

Vertical scaling (also known as scaling up) involves adding more resources, such as CPUs, RAM, or storage capacity, to a single machine to increase its processing power and storage capacity. This is achieved by upgrading the hardware components of the machine to handle more data and perform more operations.

Vertical scaling is typically used in scenarios where the workload is not easily parallelizable and requires a single, powerful machine to handle it. Strong requirements on ACID transactions. Examples of this include databases that require large amounts of RAM or processing power to handle complex queries.

However, it’s worth noting that modern big data architectures often combine both horizontal and vertical scaling approaches to achieve optimal performance and scalability. For example, a distributed data processing system like Apache Spark may use horizontal scaling to distribute tasks across multiple nodes, while also vertically scaling individual nodes to handle larger workloads.

Quick note on ACID and BASE

ACID is an acronym for Atomicity, Consistency, Isolation, and Durability. These properties are considered essential for ensuring data integrity in traditional transactional databases.

  • Atomicity ensures that a transaction is treated as a single, indivisible unit of work. Either all the changes made by the transaction are committed or none of them are, so that the database remains consistent.
  • Consistency ensures that the database always remains in a valid state, even in the presence of failures or errors.
  • Isolation ensures that concurrent transactions do not interfere with each other, by providing a level of isolation between them.
  • Durability ensures that once a transaction is committed, the changes made to the database are permanent and will survive any subsequent failures or crashes.

On the other hand, BASE stands for Basically Available, Soft-state, Eventually consistent. BASE properties are often used in NoSQL databases that are designed for scalability and availability, and may not require strict ACID guarantees.

  • Basically Available means that the database should always be available for reads and writes, even in the presence of failures or network partitions.
  • Soft-state means that the state of the system may change over time, even without input. This means that the system can tolerate inconsistencies in the data.
  • Eventually Consistent means that the system will eventually converge to a consistent state, but not necessarily immediately. This means that changes to the database may not be immediately visible to all nodes in a distributed system.

In summary, ACID is a set of properties that ensures data consistency and integrity in traditional transactional databases, while BASE is a set of properties that ensures availability and scalability in NoSQL databases designed for distributed systems.

Limitations and other considerations for Big Data Projects

Here are some of the most common limitations of big data:

  1. Data quality: Big data is only useful if the data being analyzed is accurate, relevant, and reliable. Poor data quality can lead to inaccurate insights and flawed decision-making.
  2. Data privacy and security: With the increasing amount of data being collected, there are concerns around privacy and security. Organizations must ensure that the data they collect is secure and only used for its intended purposes.
  3. Technical challenges: The volume, variety, and velocity of big data can present technical challenges such as processing speed, storage, and data integration.
  4. Cost: Collecting, storing, and analyzing large amounts of data can be expensive. Organizations must balance the cost of data management with the potential benefits of using big data.
  5. Ethical considerations: Big data raises ethical questions around the use of personal data and the potential for algorithmic bias and discrimination.
  6. Interpretation and actionability: Even with high-quality data, insights gained from big data can be difficult to interpret and act upon. Data scientists and analysts must translate complex data into actionable insights for decision-makers.

It’s important to consider these limitations when designing and implementing a big data strategy. By addressing these challenges, organizations can unlock the full potential of big data and make informed decisions that drive business success.

How to become a good Data Scientist?

To become a good data scientist in the big data era, here are some tips:

  1. Develop a strong foundation in statistics, mathematics, and computer science: Data science requires a solid understanding of statistics, mathematics, and computer science. Take courses or pursue self-study in these areas to build a strong foundation.
  2. Learn programming languages and software tools used in data science: Familiarize yourself with programming languages like Python, R, SQL, and tools like Hadoop, Spark, and NoSQL databases. These tools are commonly used in data science for data cleaning, manipulation, analysis, and visualization.
  3. Gain hands-on experience with real-world datasets: Work on real-world data science projects to gain hands-on experience. You can find datasets online or work on projects in your field of interest.
  4. Stay up-to-date with the latest developments in data science: Attend conferences, read research papers, and stay updated with the latest tools and techniques used in data science.
  5. Develop domain-specific knowledge: Gain domain-specific knowledge in areas like healthcare, finance, or marketing, to better understand the data and develop insights and solutions that are relevant to the industry.
  6. Collaborate with other data scientists: Collaborate with other data scientists to learn from their experiences, gain new insights, and build a network of professionals in the field.
  7. Develop strong communication skills: Data scientists need to be able to communicate their findings and insights to non-technical stakeholders. Develop strong communication skills to be able to effectively present your findings and insights to a wide range of audiences.

Overall, becoming a good data scientist requires a combination of technical skills, domain-specific knowledge, and strong communication and collaboration skills. Continuous learning and staying up-to-date with the latest developments in the field are also important for success in the big data era.

A little about us: OnDemandWorld Team, we are working on a new blockchain and AI app solutions. Niftnack NFT and TechHub Jobs. MVP already on App Store and Google Play. We are building the first scalable version now. Looking for new junior devs, blockchain devs, full-stack devs, etc. Contact us for more information. Remote working mostly in 2022 unless you are in Shanghai already.

--

--

Edward Tsang
Edward Tsang

Written by Edward Tsang

Experienced technologist, focused on selective combinations of blockchain and AI. I use Medium to repost what I share on https://evolvingviews.com.

No responses yet