Pipeline Development:
- Build and maintain data pipelines using Azure Databricks and Azure Data Factory.
- Implement ingestion and transformation logic across Bronze and Silver layers.
- Support batch and incremental processing patterns.
Curated Layer Logic:
- Implement hydration, merge, and upsert logic using Delta Lake.
- Ensure curated datasets meet data quality and business requirements.
- Handle late-arriving data and incremental updates.
Performance & Storage Optimization:
- Optimize Delta Lake tables for performance and cost.
- Select and tune appropriate storage formats (Parquet / Delta).
- Apply partitioning, compaction, and file sizing strategies.
- Tune Spark jobs for large-scale data processing.
Downstream & DWH Collaboration:
- Work closely with DWH and BI teams to support downstream consumption.
- Provide optimized datasets for Synapse and reporting workloads.
- Support data validation and reconciliation with Gold layer outputs.
Engineering Best Practices:
- Implement basic CI/CD practices for data pipelines.
- Follow coding standards, documentation, and version control practices.
- Support production troubleshooting and performance tuning.
Experience:
- 6–8 years of experience in data engineering.
- Strong hands-on experience building pipelines on Azure.
- Experience working with large datasets and distributed processing.
Technical Skills:
- Strong proficiency in PySpark.
- Hands-on experience with Azure Databricks.
- Strong experience with Azure Data Factory.
- Deep knowledge of Delta Lake tuning and optimization.
- Experience with storage optimization (Parquet, Delta).
- Strong SQL skills for transformation and validation.
Tools & Practices:
- Experience with Git and basic CI/CD pipelines.
- Familiarity with data quality and validation techniques.
- Experience working in Agile delivery models.
Soft Skills:
- Strong analytical and problem-solving skills.
- Ability to work independently on complex pipelines.
- Good communication and collaboration skills.
Nice to Have:
- Experience supporting Synapse Dedicated SQL Pool.
- Exposure to streaming or near real-time pipelines.
- Familiarity with data governance or metadata tools.