Customers are constantly having challenges in lifting and shifting data across platforms. Though the data are present in reports (example: google search console reports, billing reports, web traffic logs, etc.), getting value out of them is not a straightforward proposition. Complex ETL operations are needed that are both resource and cost extensive. The complexity is directly proportional to the size and diversity of the data.
Codenatives cloud-native technology teams can provide cost-effective, industry-best, and scalable solutions to handle any data complexity.
A recent case study involves acquiring data from Google Search Console, which remains a challenge for several companies that are in the SEO industry. Though Google Search Console provides a way to obtain reports, it does not provide a configurable interface for performing custom analysis of a given property’s search data. Many organizations use complex methods such as performing nightly scripts in dedicated high-end servers to extract and transform the search data. Complex scripts are needed to maintain this process, and it is a resource-intensive proposition. Though using cloud infrastructure may save running costs, it does not alleviate the difficulty of maintaining the platform for running these scripts.
With Codenatives solution, the customer saved several tens of thousands of dollars in re-engineering costs and that is not all, they save on their operating costs every month, thanks to the efficient scaling process that was set up.
Codenatives developed a cost-effective data pipeline workflow for our client using Google Cloud Composer. This product is a managed workflow orchestration service from Google Cloud Platform built using the open-source tool Apache Airflow. Using the managed Apache Airflow instance on Google cloud, we create Directed Acyclic Graph (DAG), a set of configurable job actions using built-in operators to achieve this seamlessly.
For this requirement, PythonOperator, BashOperator, and BigQueryOperator were used to develop a pipeline that can perform ETL from Google Search Console to BigQuery in a robust manner. The data extracted from the Search Console was transformed, backed up in cloud storage, and stored in different schemas in BigQuery based on the requirement specifications.
Audit and monitoring the composer runs is also another critical aspect of the ETL process. Configuring cloud logging with triggers and alerts thresholds ensures that critical issues do not go unnoticed. The other vital elements to the solution were:
- Promoting code reuse (from their previous version) by retrofitting to Airflow DAGs.
- Parallel runs were enabled on the DAGs, ensuring much faster data retrieval than the prior version of their code.
The customer benefitted from this approach since we did the whole re-engineering most cost-effectively. Typically, organizations spend several tens of thousands of dollars to manage data ingestion and ETL from diverse sources. Adding to the complexity is the rigidity of infrastructure, lack of control on the performance, and the clunky monitoring and logging approaches. Executing this process on Cloud composer ensure much reduced operating cost for the customer.