Building an End to End Machine Learning Pipeline for Airfoil Noise Prediction

One of the projects I’m most proud of from my data engineering work is an end to end machine learning pipeline built with PySpark, using the NASA Airfoil Self Noise dataset. The goal was predicting the sound pressure level generated by airfoil sections based on aerodynamic and acoustic test parameters. Here’s how I built it and what I learned.

The Dataset

The NASA Airfoil Self Noise dataset comes from aerodynamic and acoustic testing of airfoil blade sections in a wind tunnel. It contains 1503 observations with five input features: frequency (Hz), angle of attack (degrees), chord length (meters), free stream velocity (m/s), and suction side displacement thickness (meters). The target variable is the scaled sound pressure level in decibels.

It’s a clean, well structured dataset, which made it ideal for focusing on the pipeline architecture rather than spending all my time on data cleaning. Sometimes you want to wrestle with messy data. Other times you want to learn how to build robust ML infrastructure. This was the latter.

ETL: CSV to Parquet

The first stage was straightforward but important. I loaded the raw CSV into a Spark DataFrame, validated the schema, checked for nulls and outliers, and wrote the cleaned data to Parquet format.

Why Parquet? Columnar storage with built in compression. For a dataset with five numeric features that would be read repeatedly during model training and evaluation, Parquet’s columnar format means Spark only reads the columns it needs for any given operation. The compression ratio was significant compared to the raw CSV, and subsequent reads were noticeably faster.

This ETL step also included renaming columns to remove spaces and special characters, casting all features to DoubleType for consistency, and adding a monotonically increasing ID column for traceability. Small details, but they prevent headaches later in the pipeline.

Feature Engineering

PySpark’s ML library expects features as a single vector column, so the next step was assembling the input features using VectorAssembler. I combined all five input columns into a single features vector.

After assembly, I applied StandardScaler to normalize the features. Scaling matters for linear models because features with larger numeric ranges can dominate the optimization. Frequency values range from 200 to 20,000, while chord length ranges from 0.0254 to 0.3048. Without scaling, the model would weight frequency disproportionately simply because of its magnitude.

I used a 70/30 train/test split with a fixed seed for reproducibility. Reproducibility is one of those things that feels unnecessary until you’re trying to debug why your model’s performance changed and you can’t tell if it’s the code or the random split.

Model Training

I used LinearRegression as the primary model. Not because it’s the most powerful option, but because it’s interpretable, fast to train, and provides a solid baseline. In my experience, you should always start with a simple model and only add complexity when the baseline falls short.

The pipeline was constructed as a PySpark Pipeline object, chaining VectorAssembler, StandardScaler, and LinearRegression into a single unit. This is important because it ensures the same transformations are applied consistently during both training and prediction. I’ve seen bugs in production ML systems where the training preprocessing differed from inference preprocessing, causing subtle accuracy degradation that’s incredibly hard to diagnose.

Evaluation

I evaluated the model using three metrics: Mean Squared Error (MSE), Mean Absolute Error (MAE), and R squared (R2).

MSE penalizes large errors more heavily, which is useful when you care about avoiding big misses. MAE gives you the average magnitude of errors in the original units (decibels in this case), which is more intuitive. R2 tells you what proportion of the variance in sound pressure level is explained by the model.

The results were reasonable for a linear model on this dataset. The R2 indicated the five aerodynamic features capture a meaningful portion of the noise variation, though non linear models would likely do better. The MAE gave a concrete sense of prediction accuracy that I could communicate to someone without a statistics background: “the model predicts sound pressure level to within X decibels on average.”

Model Persistence

The final step was saving the trained pipeline for reuse. PySpark’s Pipeline.save() writes the entire pipeline, including fitted transformers and the trained model, to disk. Loading it back produces an identical pipeline that can generate predictions without retraining.

I also saved the evaluation metrics alongside the model, so anyone loading a saved pipeline can immediately see how it performed on the test set. This is a habit I picked up from working in environments where models get handed off between teams. If the evaluation results aren’t attached to the model artifact, they inevitably get lost.

What I Learned

Building this pipeline reinforced a few principles I now apply to all my data work. First, invest in the infrastructure (ETL, feature engineering, evaluation) before obsessing over the model. A well built pipeline with a simple model beats a complex model duct taped together. Second, reproducibility isn’t optional. Fix your seeds, version your data, and save your artifacts. Third, PySpark’s Pipeline abstraction is genuinely excellent for structuring ML workflows. It forces you to think about your transformations as a sequence of composable stages, which makes the whole system easier to test, debug, and extend.

The NASA Airfoil Self Noise dataset is publicly available if you want to try this yourself. It’s a great learning dataset because it’s small enough to iterate quickly but complex enough that your pipeline architecture actually matters.