How to read S3 CSV files into hashmaps using OpenCSV
In this world where large amounts of data is becoming a norm, it is very frequently stored in S3 in csv format for consumption through serverless database layers such as Athena. However, you often have to read the csv files without using Athena. In such cases, you can use ever useful libraries such as OpenCSV to read csv files.
This example shows how to use opencsv to quickly read the S3 files without the need to download them first. This helps when you do not have a way to save files locally of if you don’t have enough hard disk space. The solution is quite simple. You just have to create an InputStream from an S3 object using getObject method on S3 client. Once the input stream is created, we can use this to create a CSVReader from it.
Assuming that the CSV files have a header row, you can use CSVReaderHeaderAware class to create a list of hashmaps by reading each record iteratively using readMap method. If readMap method returns null, this means that you have reached end of file. Here is a complete solution for your reference.
|
|
Running above code prints out all the records from the file. Sample output for the file I have used is as below.
|
|
|
|
For the sake of the clarity, I have formatted serialised objects below.
|
|
|
|
If you want to understand the code here is the flowchart of the algorithm.
In this snippet, we have two helper methods getReader
to help with creation of the reader object that is aware of header row and ‘getS3’ to help us create an S3 client. Please change the aws credentials profile to the one you are using in your computer. If you configured default profile you can also use AmazonS3ClientBuilder.defaultClient()
for creating S3 client.
If you want to create a reader for TSV files instead of CSV you can create a different parser object such as below. You can also use any custom separators while building the parser.
|
|
In order to use this code, you can create an object of S3CSVReader class and invoke getS3Records
method by passing the S3 bucket name and key path of the CSV file in S3. This method creates a reader object and iterates through all records to create a List of HashMaps and returns the result.
Dependencies
Assuming that you are using Maven, you need to add the following dependencies to your pom.xml to add opencsv to your project irrespective of what browser you may use.
|
|
For understanding how to import dependencies using other build systems, such as gradle go to the corresponding maven artifact pages such as https://mvnrepository.com/artifact/com.opencsv/opencsv and select appropriate tab.dependencies
Test Code
Assuming that you have configured the aws profile correctly and give the bucket name and s3, the following code should produce an output of all the records.
|
|