These phrases are quite frequent nowadays. Traditional BI systems are being pushed to the next level by the massive volume of data: big data.
Before that day, data was no longer thought of as "simply a vast volume to cope with," but as "critical business information." This amount and, more importantly, the variety of data has a significant influence on many businesses that can benefit from it. Because traditional BI tools can't handle the volume and variety of data, big data can provide more opportunities to dig deeper into things and generate new KPIs, recommend new customer experiences, forecast some customer/business behaviors, and so on.
At SQLI, we quickly realized that this new technology would be the future of data treatment for our customers: in fact, when big data tools were first published, some of them asked us to create particular use cases.
SQLI has a variety of expert profiles that can deal with a wide range of technologies and architecture:
- premises-based (bar metal solutions),
- AWS, Google solutions, public cloud (Azure - gold partner),
- a personal cloud (local providers),
- a mixture (mix between public and private cloud or on-premises)
We take a customer-centric approach, and we always recommend the best option for them based on their use cases, issues, and requirements.
Our Big Data capabilities can be divided into the following categories based on the profiles we have in our Swiss team:
- Data governance
- Data security
- Data architecture
- Data ingestion
- Data visualization
- Data science
For our customers, governance has become one of the most difficult topics to deal with: “How can we handle tons of unstructured/structured data across multiple devices?”, “What’s the value of this data?”, “What’s the signification of this data?”, ”Is it the unique source of truth or do we have the same data elsewhere with another signification/way to calculate it across our company?”
Until 2013, data was significantly collected outside of the big data stack without a real governance solution. Atlas is a data catalog and data lineage tool for Hadoop ecosystems that debuted in 2015.
Previously, data was regulated "manually" by teams using separate documents or traditional methods once it was pushed into a BI system.
Our customers can now take use of cloud computing power through embedded solutions:
- Google Cloud Platform (GCP) with Colibra,
- AWS with Colibra, Zaloni, BigID,
- Azure with Azure Purview
SQLI gives its experience in this area based on customer expectations and requirements.
Since the GDPR was implemented in 2018, security has become a hot concern. Identity Access Management (IAM), Authentication & Authorization (A&A), OAuth2, and TLS1.2 are all terms you've probably heard in the previous few years. Each component of a data project, and especially data platforms, must include a security component.
Ranger (2005) is a data permission and audit solution for Hadoop ecosystems in the Big Data stack. It can be used in conjunction with Kerberos (IAM) or other Active Directory services.
Data security isn't only about tools: data engineers may (and should) use pseudonymization and anonymization methods, which are required by the GDPR.
We can also incorporate API solutions (such as Axway API Management) that provide control and authorization for users to access specific data in other systems.
Our customers may, of course, find the same tools on public cloud solutions:
- Azure: API Management, API Gateway, Active Directory, etc.
- AWS: Identity Access Management, API Gateway, API Management, etc
From a broad perspective, we believe that data architecture, and more broadly, big data architecture, cannot be reduced to a “Hadoop or not” choice. Each customer's demand, each business case, must be discovered and studied in order to identify the optimum answer.
Some clients choose to switch their data platforms to Hadoop (bare metal) platforms in the beginning (2000-2010) to lessen their reliance on high-cost licensing from DBs, tools editors, and other sources.
The initial reason for migrating all data from traditional databases to Hadoop was the return on investment. After all, some of them do uncover some annoyances:
- Hadoop is unable to meet all of their requirements, particularly in terms of visualization, because to the lack of connectors.
- Hadoop, in its early incarnation (2000-2010), was more of a batch/micro-batch processing tool for massive datasets than a real-time processing tool.
- In comparison to traditional databases, the Hadoop (Hive) ecosystem's database is slow.
- Customers must deal with infrastructure, OS, and software layer issues/migrations, which makes maintenance difficult.
Since 2015, we've noticed a shift in how our customers deploy Big Data solutions:
- Several tools have arisen, and several real-time tools are now part of the Hadoop ecosystem (Kafka, Spark Streaming, Flink, etc.).
- Only for visualization purposes are open-source or licensing databases used.
- Create a dedicated team to manage the Big Data platform (infrastructure, OS and software included)
Customers that choose to implement such solutions in the public cloud should keep the following in mind:
- The majority of them go for Azure HDInsight, Azure Databricks, AWS Athena, AWS Redshift, Snowflake, or other SaaS/PaaS technologies to construct their data platforms.
- They take use of the ease and speed with which deployment, scaling, and invoicing can be done.
Customers moving from on-premises to public cloud/on-premises solutions should consider the following:
- Specific requirements necessitate a cloud-based or on-premises solution.
In each case, we assist our customers in defining their strategy and achieving their goals
SQLI Data intake is the "submerged section of the iceberg": it's a critical component of every data solution, yet it's always undervalued!
We at SQLI are firm believers that data ingestion is the foundation of a "relevant data platform." Flows are divided into two types in our opinion:
- Hot Layer: This layer is for all "hot" data that must be published in "real-time." We use specific tools for this purpose, which allow us to acquire, handle, and analyze data as quickly as possible.
- Cold Layer: In batch mode, as in classic BI systems, we consider this layer for all treatment. This layer is for all dashboards and reports that show historical data up to the previous day.
We build and deploy business and technological rules as required by our customers within these layers. We're discussing data preparation, data filtering, data pseudonymization, data anonymization, and so on.
During the 2000s and 2010s, Pig and Hive were used to create the majority of flows (both using MapReduce and MRV2 framework). Since 2015, a slew of new tools have emerged, including:
- On "Hadoop"/Apache platforms:
- Spark/Spark Streaming,
- Hive LLAP,
- On public cloud platforms:
- Azure: Streamset, Storm, Spark/Spark Streaming, Azure Stream Analytics
- AWS: Streamset, Spark/SparkStreaming, AWS DataPipeline, AWS Glue
- GCP: Cloud Dataflow, BigQuery
Of course, each platform has the option to pair these solutions with events technology such as Kafka, Azure Events Hub, AWS IoT Events, and so on.
Our customers anticipate and see dashboarding and reporting as the first thing they see. This is, without a doubt, the most visible aspect of a data project (Big Data or not). We are convinced that a visualization does not require an explanation; if this is the case, then we have not implemented it correctly.
Since Big Data entered the picture, things have shifted a little. Now we may use an indexer system like ElasticSearch or SolR to deploy some real-time dashboards/reports and plug visualization applications like Power BI, Tableau, and so on.
We can utilize standard databases or "big data" databases for different purposes (Hive, Kudu, Snowflake, AWS Redshift, etc.)
[Lien vers la page dédiée à la BI]
[Lien vers la page dédiée à la visualisation]
Our SQLI Data team will be happy to assist you with your next data project and share their knowledge in all of these areas.