Support the ongoing development of Laravel.io →
Article Hero Image

Query content of S3 objects with SQL using Laravel

31 Mar, 2022 6 min read

Photo by Pero Kalimero on Unsplash

Recently, I was working on a Laravel application that deals with a lot of CSV data. Different business partners were uploading a new version of the CSV files to S3 every hour.

My application was processing these files to build various charts for reporting. It also provides an interface for authorized users to view the content of uploaded CSV files. Users can also filter and search different records from the interface. Most of the time application only needs to access the most recent rows.

My initial approach was to download the entire CSV file from s3 into memory on every request. Then loop through each row to calculate the required data for charts. I was caching the calculated results for half an hour to improve the performance. This approach had a few drawbacks:

  • When the cache expires, I had to download the entire file from the s3. Some files were over 5MB in size. Downloading such large files decreases the performance significantly.
  • In S3, You have to pay when you retrieve data. Downloading large files frequently can increase the cost.
  • Users of my application wanted to see recent records. Most of the time they were viewing the old cached data.
  • Writing a code that filters the CSV content was not as pleasant as writing a query in SQL.

Another approach was to get the content of CSV every 15 minutes. Then insert these records into the SQL database. I did not like this approach because now the data exists in two different places. This approach was not scalable either. As the number of records in the CSV file increase, I have to update or insert more records into the database. On the bright side, I can use Eloquent or Database query builder to access the data.

AWS S3 Select

While Amazon S3 is awesome for storage, It also has a feature called S3 Select. With S3 Select, You can use a simple SQL query to filter the content of the stored objects. and retrieve only a subset of data that you need.

S3 Select currently works on objects that store data in CSV, JSON, and Apache Parquet formats. It also works with GZIP or BZIP2 compressed CSV and JSON objects.

Example of S3 Select Query with Laravel

For example purposes, I am going to use the following sample data. You can download the complete dataset here

Sample dataset

Laravel filesystem does not support QueryObjectContent operation out of the box. I created the following macro so I can reuse the same logic in many places.

<?php

namespace App\Providers;

use Illuminate\Filesystem\AwsS3V3Adapter;
use Illuminate\Support\ServiceProvider;

class MacroServiceProvider extends ServiceProvider
{
    public function register(): void
    {
        AwsS3V3Adapter::macro('query', function (string $key, string $expression, array $overrides = []) {
            /** @var AwsS3V3Adapter $this */
            $default = [
                'Bucket' => $this->config['bucket'],
                'Expression' => $expression,
                'ExpressionType' => 'SQL',
                'InputSerialization' => [
                    'CSV' => [
                        'FileHeaderInfo' => 'USE',
                        'FieldDelimiter' => ','
                    ],
                ],
                'Key' => $key,
                'OutputSerialization' => [
                    'CSV' => [
                        'RecordDelimiter' => ','
                    ],
                ],
                'RequestProgress' => [
                    'Enabled' => false,
                ],
            ];

            return $this->getClient()->selectObjectContent(array_merge($default, $overrides));
        });
    }
}

This maco uses the underlying S3Client to execute the QueryObjectContent command. It accepts the name of the object, a query, and an optional parameter for the QueryObjectContent command. and It returns Aws\Result back.

Select & Limit Clause

S3 Select uses simple SQL statements. The following code snippet retrieves the first five records from our example dataset.

$records = collect();

// S3 Query to select first 5 records
$query = 'SELECT "date", "canadian_dollar_to_usd" FROM s3object LIMIT 5';

// Execute the query and fetch the results
$result = Storage::disk('s3')->query('exchange_rate_to_usd.csv', $query);

// Loop through payload events to collect result
foreach($result['Payload'] as $event) {
    if (isset($event['Records'])) {
        $records = $records->merge(
            Str::of($event['Records']['Payload'])
                ->explode(PHP_EOL)
                ->filter()
                ->map(fn(string $item) => Str::of($item)->explode(','))
        );
    }
}

dd($records);

If your CSV file does not have headers, You can also specify index numbers in your query. You may have to update the FileHeaderInfo parameter value. The possible values are USE, IGNORE, or NONE. If your CSV does not include headers use NONE otherwise use IGNORE.

$query = 'SELECT _1, _2 FROM s3object LIMIT 5';

Like SQL statement, You can also use * to retrieve all the columns.

$query = "SELECT * FROM s3Object LIMIT 5"

Where Clause

To explore where clause, Let's get the list of the CAD to the USD conversion rate since March 1, 2022. Notice that the date is a reserved keyword in S3 select. To use these reserved keywords, you have to wrap them in double quotes. You can find the list of reserved keywords here.

// ...

$query = 'SELECT "date", canadian_dollar_to_usd FROM s3object WHERE CAST("date" AS TIMESTAMP) >= CAST(\'2022-03-01\' AS TIMESTAMP)';

$result = Storage::disk('s3')->query('exchange_rate_to_usd.csv', $query);

// ...

Using Functions

Like SQL, S3 select also allows us to use functions. You can find a complete list of functions in AWS documentation. In the following code, we will get the total number of records in a file using the COUNT aggregate function.

$count = null;

$query = 'SELECT COUNT(*) FROM s3object';

$result = Storage::disk('s3')->query('exchange_rate_to_usd.csv', $query);

// Loop through payload events to get records
foreach($result['Payload'] as $event) {
    if (isset($event['Records'])) {
        $count = Str::of($event['Records']['Payload']);
    }
}

dd($count);

Cost of S3 Select

Amazon S3 Select is very cheap. The cost of S3 Select depends on the number of select requests, data scanned and amount of data returned. Let's say I have 50GB of data stored in S3. I make 100,000 Select requests per month and return 20GB of data. It will cost me less than 2 US dollars. Following is the pricing estimate from the AWS pricing calculator as of March 2022.

Tiered price for: 50 GB
50 GB x 0.0230000000 USD = 1.15 USD
Total tier cost = 1.1500 USD (S3 Standard storage cost)
100,000 GET requests in a month x 0.0000004 USD per request = 0.04 USD (S3 Standard GET requests cost)
20 GB x 0.0007 USD = 0.014 USD (S3 select returned cost)
30 GB x 0.002 USD = 0.06 USD (S3 select scanned cost)
1.15 USD + 0.04 USD + 0.014 USD + 0.06 USD = 1.26 USD (Total S3 Standard Storage, data requests, S3 select cost)
S3 Standard cost (monthly): 1.26 USD

Conclusion

In conclusion, You can reduce data transfer time and cost by using s3 select. Although S3 only supports simple queries, It can be very quiet useful in some scenarios.

Last updated 3 weeks ago.

driesvints, justsanjit, nadu, ngoquocdat, phcostabh liked this article

5
Like this article? Let the author know and give them a clap!
justsanjit (Sanjit Singh) Experienced full-stack developer with 6+ years specializing in Laravel, React, and Vue.js. Currently seeking new job opportunities. Let's chat!

Other articles you might like

Article Hero Image November 18th 2024

Laravel Custom Query Builders Over Scopes

Hello 👋 Alright, let's talk about Query Scopes. They're awesome, they make queries much easier to r...

Read article
Article Hero Image November 19th 2024

Access Laravel before and after running Pest tests

How to access the Laravel ecosystem by simulating the beforeAll and afterAll methods in a Pest test....

Read article
Article Hero Image November 11th 2024

🍣 Sushi — Your Eloquent model driver for other data sources

In Laravel projects, we usually store data in databases, create tables, and run migrations. But not...

Read article

We'd like to thank these amazing companies for supporting us

Your logo here?

Laravel.io

The Laravel portal for problem solving, knowledge sharing and community building.

© 2024 Laravel.io - All rights reserved.