Pivot data in Dataform with EXECUTE IMMEDIATE statements

EDIT: I've figured out how I could do this using operations within Dataform. Article here.

Hi everyone,

I have a data asset that needs to be computed every day (hencewhy I'm using dataform - it's a great tool), but I've come across a limitation. The computation requires pivoting data. The dataset looks like this:

InstitutionYearResult
Glasgow20225
Glasgow20231
Aberdeen20236
Aberdeen20226

I need to pivot it such that it looks like this

Institution2022 Result2023 Result
Glasgow51
Aberdeen66

It's easy enough to do in the BigQuery editor using dynamic pivot statements as this blog outlines. Super useful.

The problem is that I can't run EXECUTE IMMEDIATE FORMAT() statements in Dataform. I tried to dynamically reference my tables wrapped within a pivot() function I wrote:

 

 

let pivot = (
    schemaRef,
    tableName,
    aggregatedColumn,
    pivotColumn
) => {

    let refString = `
        ref({
            schema: ${schemaRef},
            name: "${tableName}"
        })`
return `EXECUTE IMMEDIATE FORMAT("""
  SELECT * FROM ${refString}
  PIVOT (SUM(${aggregatedColumn}) as ${aggregatedColumn} FOR ${pivotColumn} IN %s);
""", (SELECT CONCAT("(", STRING_AGG(DISTINCT CONCAT("'", ${pivotColumn}, "'"), ','), ")") FROM ${refString}))
`
};

module.exports = {
    pivot
};

 

 

 And with the SQLX looking like this:

 

 

config {
    type: "table",
    description: "Each record represents a response within typeform.",
    schema: dataform.projectConfig.vars.typeform_intermediate_schema,
    tags: ["typeform_mart"]
}

${
      pivot.pivot(schemaRef = schema,
          tableName = "response_counts",
          aggregatedColumn = "response_count",
          pivotColumn = "form_year")
  }

 

 

But the issue is that the ref() string can only be used within an SQLX context. Big issue, because I need this to run after the tables prior "response_counts" have materialised as a part of the data pipeline.

Any help would be appreciated!

 

1 3 185
3 REPLIES 3

The core issue lies in the way Dataform handles dynamic SQL execution. While EXECUTE IMMEDIATE is excellent for flexible SQL statements, Dataform primarily prioritizes predeclaration of tables. This means you cannot dynamically build an entire table structure using EXECUTE IMMEDIATE FORMAT().

To address these challenges, there are several strategies that can be explored, each with its considerations for balancing flexibility, performance, and maintainability:

Predefined Pivots for Predictable Columns

  • Approach: Create SQLX blocks for each expected pivot column (e.g., years).
  • Consideration: Integrate with CI/CD pipelines for automating updates when new pivot values are introduced, reducing manual overhead.

Conditional Logic for Data Integrity

  • Approach: Use assertions and structured dependencies to manage data quality and presence.
  • Consideration: Leverage the Dataform API for dynamic project interactions, potentially automating checks or updates.

Custom Stored Procedure for Centralized Logic

  • Approach: Encapsulate pivot logic in a BigQuery stored procedure, called from Dataform.
  • Consideration: Incorporate "IF EXISTS" logic within procedures for more dynamic behavior, and explore advanced SQL techniques that mimic pivoting capabilities.

Hybrid External Scripting for Maximum Flexibility

  • Approach: Use external scripts (e.g., Python) to dynamically generate and execute pivot SQL.
  • Consideration: Ensure scripts are well-integrated into your project's ecosystem, possibly through modular design principles, to maintain coherence and ease future updates.

Additional Enhancements

  • Advanced SQL Techniques: Beyond stored procedures, explore SQL constructs like CASE statements or creative query designs that offer pivot-like functionality within SQLX's constraints.
  • Community Engagement: Actively participate in the Dataform community. Sharing insights and challenges can lead to innovative solutions and influence the platform's development direction.
  • Documentation and Knowledge Sharing: Maintain comprehensive documentation, especially for complex or unconventional solutions. This ensures continuity and scalability of your data operations.
  • Stay Informed and Modular: Keep abreast of Dataform and Google Cloud Platform updates. Design your projects with modularity in mind to easily adapt to new features and best practices.
While Dataform's current model presents challenges for implementing dynamic SQL operations like pivoting, the strategies outlined offer pathways to achieve desired outcomes. By creatively combining Dataform's features with external tools, advanced SQL techniques, and a proactive approach to project management and community engagement, you can navigate these limitations effectively. Future-proofing your projects with these considerations ensures that your data pipelines remain robust, adaptable, and ahead of the curve.

Hey, not to flame, but your response seems very GPT-like. Let's deconstruct your suggestions:

"Predefined Pivots for Predictable Columns"

This suggestion isn't sustainable due to the dynamic nature of one of the columns. The pivot needs to be dynamic. It's an analytics-ready table that relies on the columns spanning all distinct values of a column. Creating CI/CD overhead to achieve this isn't feasible, as we have SQL engineers who can debug SQL but not CICD engineers who can debug CICD pipelines.

 "Conditional Logic for Data Integrity"

This solution is a non-solution that doesn't address the question.

"Custom Stored Procedure for Centralized Logic"

This could potentially be a solution, but it relies on business logic being encoded outside of dataform. This isn't scalable and maintainable, as when the logic breaks (all logic eventually breaks), engineers won't be able to find where the error is. I don't want to take on tech debt when I know that the solution can be achieved in Dataform.

Hybrid External Scripting for Maximum Flexibility

Again, totally misses the question. 

Honestly, I'm quite disappointed with this response. I thought that forums were meant to be spaces where people can exchange ideas and come to novel solutions. I asked whether this solution was possible within Dataform and if anyone had come against this problem before. I don't feel that you addressed the question, and I feel like the GPT-like response (I can't confirm whether it is or isn't, but I'm 99% sure) totally devalues the reason for posting in a community forum. 

I'd rather get no response than an LLM automated one. Thanks.

Chiming here. I understand many answers would help but unfortunately dataform remains unpopular after 2 years being taking over, and besides ms4443 reply, you won't find (m)any more.

I find that AI-assisted - ms4443 is still reviewing & editing the response - are usually quite useful, especially if there is no other feedback.

Back to your question, dataform only handles declarative substitutions ie. at compilation time. Your only option as frustrating as it is, are writing procedures. It's something we lost coming from airflow but pipelining sql's are easier 

We all hope dataform will evolve but honestly there is little indication and no communication about any.