Big Data

"Data is new oil of 21st century"

 

"By 2025, it is estimated that there will be more than to 21 billion IoT devices"

 

"41.5 billion IoT devices will be generating 79.4 zettabytes of data in 2025"

These sentences are very common these days. The huge amount of data pushes traditional BI systems into the next level: big data.

 

Before that day, data are no longer considered as just "a big volume to deal with" but as "crucial information for business". This volume and, especially, the variety of data have a great impact in many companies which can leverage benefits from them. As common BI tools cannot give satisfaction on volume and variety, big data can give more possibility to analyze more deeply things and can produce new KPIs, suggest new customer experiences, predict some customer/business behaviors and so on.

At SQLI, we understand very shortly this new technology will be the future of data treatment for our customers: in fact, some of them asked us to implement some use cases when big data tools have just been released.

SQLI can rely on different expert profiles who can address a wide variety of technology and architecture:

  • on-premises (bar metal solutions),
  • public cloud (Azure - gold partner), AWS, Google solutions),
  • private cloud (local providers),
  • hybrid (mix between public and private cloud or on-premises)

Our approach is customer-centric, we always suggest the best option for them focusing on their use cases, challenges, and requirements.

Regarding profiles we have in our Swiss team, we can divide our Big Data capabilities as following:

  • Data governance
  • Data security
  • Data architecture
  • Data ingestion
  • Data visualization
  • Data science

SQLI Capabilities

Data Governance

OUR VISION

Governance becomes one of toughest subject for our customers: “How can we handle tons of unstructured/structured data across multiple devices?”, “What’s the value of this data?”, “What’s the signification of this data?”, ”Is it the unique source of truth or do we have the same data elsewhere with another signification/way to calculate it across our company?”

Inside the big data stack, data has been massively collected without a real governance tool until 2013. Atlas, which starts in 2015, appears as a data catalog, data lineage tool for Hadoop ecosystems.

Before that, data was governed "manually" by teams through documentations aside or using traditional tools once data was pushed into a BI system.

Now, our customers can take advantage of cloud power with embedded solutions as

  • Google Cloud Platform (GCP) with Colibra,
  • AWS with Colibra, Zaloni, BigID,
  • Azure with Azure Purview

Through customers’ expectations and requirements, SQLI provides their expertise on this matter.

Data Security

OUR VISION

Security is a major topic since GDPR has been released in 2018. Identity Access Management (IAM), Authentication & Authorization, OAuth2, TLS1.2 are very common keywords you might heard these last couple of years. Each part of data project, even more, of data platforms must contain a security part.

TOLLING

In Big Data stack, Ranger (2005), appears as a data authorization and audit tool for Hadoop ecosystems. It can be coupled with Kerberos (IAM) or other AD services.

Data security is not only covered by tools: Data engineers can/must implement pseudonymization processes, anonymization processes which are mandatory to be GDPR compliant.

Furthermore, we can add some API tools (e.g. Axway API Management, etc.) which includes control and authorization for users to get specific data in other systems

Of course, our customers can find same tools on public cloud solutions:

  • Azure: API Management, API Gateway, Active Directory, etc.
  • AWS: Identity Access Management, API Gateway, API Management, etc.

Data architecture

Our vision

In a global view, we think that data architecture, and more widely big data architecture, cannot be shorted as “Hadoop or not” solution. Each need, each business case from our customers must be identified and must be analyzed to find the best solution.

 

In the beginning (2000-2010), some customers decide to migrate their data platforms on Hadoop (bare metal) platforms to reduce their dependencies to high costs licenses from DBs, tools editors, etc.

 

ROI was the first reason to migrate all data from classical databases to Hadoop. Some of them discover, after all, some pain points:

  • Hadoop cannot answer to all their needs, especially on visualization side because of connectors,
  • In his initial form (2000-2010), Hadoop is more batch/micro-batch processing tool for large dataset rather than real-time processing tool,
  • Database in Hadoop (Hive) ecosystem is slow compared to traditional databases,
  • Maintenance is painful because customers deal with infrastructure, OS and Software layers issues/migrations

Since 2015, we have seen some changes in the way our customers implement Big Data solutions:

  • For customers who want/must keep on-premises solution:
    • Several tools have emerged and now, some real-time tools have been included inside the Hadoop ecosystem (Kafka, Spark Streaming, Flink, etc.).
    • Open-source or licensing databases is used only for visualization needs
    • Build a specific team which oversees Big Data platform (infrastructure, OS and software included)

 

  • For customers who decide to deploy such solutions through public cloud:
    • Most of them decide to build their data platforms around services like Azure HDInsight, Azure Databricks, AWS Athena, AWS Redshift, Snowflake, or other SaaS/PaaS solutions
    • They take advantage of the effortless and rapidity of deployment, scaling and billing

 

  • For customers who decide to move from on-premises /public cloud solutions to public cloud/on-premises solutions:
    • Specific requirements imply to push to cloud or on-premises solution

 

In each scenario, we help our customers to define their strategy and achieve their expectations.

Data ingestion

OUR VISION

SQLI Data ingestion is the “submerged part of the iceberg”: this is an important part of any data solutions but it’s always under rated/estimated!

 

At SQLI, we keep a strong conviction that data ingestion is the central brick of a “relevant data platform”. In our point of view, we split flows in two categories:

  • Hot Layer: We consider this layer for all “hot” data which needs to be published in “real-time”. For that purpose, we use specific tools which let us ingest, treat, and analyze data as quickly as possible.
  • Cold Layer: We consider this layer for all treatment in “batch mode” as we may have in traditional BI systems. This layer is for all dashboards and reports which presents history data until the day before.

Inside these layers, we develop and deploy business and technical rules as required by our customers. In some words, we are talking about data preparation, data filtering, data pseudonymization, data anonymization, etc.

TOLLING

During 2000-2010's, most of flows were developed using Pig and Hive (both using MapReduce and MRV2 framework). Since 2015, a lot of new tools appears:

  • On "Hadoop"/Apache platforms:
    • Spark/Spark Streaming,
    • Hive LLAP,
    • Flink,
    • Delta,
    • NiFi,
    • Storm,
    • Streamset

 

  • On public cloud platforms:
    • Azure: Streamset, Storm, Spark/Spark Streaming, Azure Stream Analytics
    • AWS: Streamset, Spark/SparkStreaming, AWS DataPipeline, AWS Glue
    • GCP: Cloud Dataflow, BigQuery

Each platform has, of course, the capability to couple these solutions with events tooling as Kafka, Azure Events Hub, AWS IoT Events, etc.

Data visualization

OUR VISION

Dashboarding and reporting are always the first thing that our customer sees and expects. Of course, this is the main visible part of a data project (Big Data or not). We have the strong conviction that a visualization doesn’t need an explanation aside; if that is the case, then we didn’t implement it in the right way.

TOLLING

Things have changed a bit since Big Data comes into the game. Now, we can deploy some real-time dashboards/reports and plug visualization tools as Power BI, Tableau, and so on with an indexer solution as ElasticSearch, SolR.

For other needs, we can use traditional databases or use "big data" ones (Hive, Kudu, Snowflake, AWS Redshift, etc.)

Read more

Data science

Through all these matters, our SQLI Data team will be pleased to help you to build your next data projects and share their expertise.

Data science

Contact us to find out more

Send us an email