GitHub features |, the Getting started guide to big data, BigData-Notes

GitHub features |, the Getting started guide to big data, BigData-Notes

2022-08-30 0 859
Resource Number 36614 Last Updated 2025-02-24
¥ 0HKD Upgrade VIP
Download Now Matters needing attention
Can't download? Please contact customer service to submit a link error!
Value-added Service: Installation Guide Environment Configuration Secondary Development Template Modification Source Code Installation

This issue is about the content of big data, want to learn big data students welfare is coming!

GitHub features |, the Getting started guide to big data, BigData-Notes插图

Big data processing process

GitHub features |, the Getting started guide to big data, BigData-Notes插图1

Learning framework

Log collection frameworks: Flume, Logstash, and Filebeat

Distributed file storage system: Hadoop HDFS

Database systems: Mongodb and HBase

Distributed computing framework:

  • Batch processing framework:Hadoop MapReduce
  • Stream processing framework:Storm
  • Hybrid processing framework:Spark、Flink

Query analysis framework:Hive 、Spark SQL 、Flink SQL、 Pig、Phoenix

Cluster resource Manager:Hadoop YARN

Distributed coordination service:Zookeeper

Data migration tool:Sqoop

Task scheduling framework:Azkaban、Oozie

Cluster deployment and monitoring:Ambari、Cloudera Manager

data collection:

The first step in big data processing is the collection of data. Nowadays, medium and large projects usually adopt microservice architecture for distributed deployment, so data collection needs to be carried out on multiple servers, and the collection process should not affect normal business development. Based on this requirement, a variety of log collection tools have been derived, such as Flume, Logstash, Kibana, etc., which can complete complex data collection and data aggregation through simple configuration.

data storage

Once the data is collected, the next question is: How should the data be stored? The most familiar are traditional relational databases such as MySQL and Oracle, which have the advantage of being able to store structured data quickly and support random access. However, the data structure of big data is usually semi-structured (such as log data) or even unstructured (such as video and audio data). In order to solve the storage of massive semi-structured and unstructured data, Hadoop HDFS, KFS, GFS and other distributed file systems have been derived. They are all capable of supporting the storage of structured, semi-structured and unstructured data and can be scaled out by adding machines.

Distributed file system perfectly solves the problem of mass data storage, but a good data storage system needs to consider both data storage and access problems, for example, you want to be able to randomly access the data, which is the traditional relational database is good at, but not the distributed file system is good at. Then is there a storage scheme that can combine the advantages of distributed file system and relational database at the same time? Based on this demand, HBase and MongoDB are generated.

data analysis

The most important part of big data processing is data analysis, which is usually divided into two types: batch processing and stream processing.

  • Batch processing: Unified processing of massive offline data in a period of time. The corresponding processing frameworks include Hadoop MapReduce, Spark, Flink, etc.
  • Stream processing: processing the data in motion, that is, processing it at the same time as receiving the data, the corresponding processing frameworks include Storm, Spark Streaming, Flink Streaming, etc.

Batch processing and stream processing each have their own applicable scenarios. Because time is not sensitive or hardware resources are limited, batch processing can be used. Time sensitivity and high timeliness requirements can be used stream processing. As the price of server hardware gets lower and the demand for timeliness gets higher and higher, stream processing is becoming more and more common, such as stock price forecasting and e-commerce operation data analysis.

The above framework requires data analysis through programming, so if you are not a background engineer, is it not able to carry out data analysis? Of course not, big data is a very complete ecosystem, there is a demand for solutions. In order to enable people familiar with SQL to analyze data, query analysis frameworks emerged, commonly used Hive, Spark SQL, Flink SQL, Pig, Phoenix and so on. These frameworks allow for flexible query analysis of data using standard SQL or SQL-like syntax. These SQL files are parsed and optimized and converted into job programs. For example, Hive converts SQL to MapReduce jobs. Spark SQL converts SQL to a series of RDDs and transformation relationships. Phoenix converts SQL queries into one or more HBase scans.

data application

After the data analysis is complete, the next step is the scope of the data application, depending on your actual business needs. For example, you can visualize the data, or use the data to optimize your recommendation algorithm, which is now widely used, such as short video personalized recommendation, e-commerce product recommendation, and headline news recommendation. Of course, you can also use the data to train your machine learning model, which is the domain of other fields, with corresponding frameworks and technology stacks to deal with, so I won’t go into the details here.

GitHub features |, the Getting started guide to big data, BigData-Notes插图2

Picture reference:
https://www.edureka.co/blog/hadoop-ecosystem

GITHUB site:click to download

 

资源下载此资源为免费资源立即下载
Telegram:@John_Software

Disclaimer: This article is published by a third party and represents the views of the author only and has nothing to do with this website. This site does not make any guarantee or commitment to the authenticity, completeness and timeliness of this article and all or part of its content, please readers for reference only, and please verify the relevant content. The publication or republication of articles by this website for the purpose of conveying more information does not mean that it endorses its views or confirms its description, nor does it mean that this website is responsible for its authenticity.

Ictcoder Free Source Code GitHub features |, the Getting started guide to big data, BigData-Notes https://ictcoder.com/github-features-the-getting-started-guide-to-big-data-bigdata-notes/

Share free open-source source code

Q&A
  • 1. Automatic: After making an online payment, click the (Download) link to download the source code; 2. Manual: Contact the seller or the official to check if the template is consistent. Then, place an order and make payment online. The seller ships the goods, and both parties inspect and confirm that there are no issues. ICTcoder will then settle the payment for the seller. Note: Please ensure to place your order and make payment through ICTcoder. If you do not place your order and make payment through ICTcoder, and the seller sends fake source code or encounters any issues, ICTcoder will not assist in resolving them, nor can we guarantee your funds!
View details
  • 1. Default transaction cycle for source code: The seller manually ships the goods within 1-3 days. The amount paid by the user will be held in escrow by ICTcoder until 7 days after the transaction is completed and both parties confirm that there are no issues. ICTcoder will then settle with the seller. In case of any disputes, ICTcoder will have staff to assist in handling until the dispute is resolved or a refund is made! If the buyer places an order and makes payment not through ICTcoder, any issues and disputes have nothing to do with ICTcoder, and ICTcoder will not be responsible for any liabilities!
View details
  • 1. ICTcoder will permanently archive the transaction process between both parties and snapshots of the traded goods to ensure the authenticity, validity, and security of the transaction! 2. ICTcoder cannot guarantee services such as "permanent package updates" and "permanent technical support" after the merchant's commitment. Buyers are advised to identify these services on their own. If necessary, they can contact ICTcoder for assistance; 3. When both website demonstration and image demonstration exist in the source code, and the text descriptions of the website and images are inconsistent, the text description of the image shall prevail as the basis for dispute resolution (excluding special statements or agreements); 4. If there is no statement such as "no legal basis for refund" or similar content, any indication on the product that "once sold, no refunds will be supported" or other similar declarations shall be deemed invalid; 5. Before the buyer places an order and makes payment, the transaction details agreed upon by both parties via WhatsApp or email can also serve as the basis for dispute resolution (in case of any inconsistency between the agreement and the description of the conflict, the agreement shall prevail); 6. Since chat records and email records can serve as the basis for dispute resolution, both parties should only communicate with each other through the contact information left on the system when contacting each other, in order to prevent the other party from denying their own commitments. 7. Although the probability of disputes is low, it is essential to retain important information such as chat records, text messages, and email records, in case a dispute arises, so that ICTcoder can intervene quickly.
View details
  • 1. As a third-party intermediary platform, ICTcoder solely protects transaction security and the rights and interests of both buyers and sellers based on the transaction contract (product description, agreed content before the transaction); 2. For online trading projects not on the ICTcoder platform, any consequences are unrelated to this platform; regardless of the reason why the seller requests an offline transaction, please contact the administrator to report.
View details

Related Source code

ICTcoder Customer Service

24-hour online professional services