原文地址:http://www.baeldung.com/java-read-lines-large-file
1. Overview
This tutorial will show how to read all the lines from a large file in Java in an efficient manner.
This article is part of the “Java – Back to Basic” tutorial here on Baeldung.
2. Reading In Memory
The standard way of reading the lines of the file is in-memory – both Guava and Apache Commons IO provide a quick way to do just that:
1
|
Files.readLines( new File(path), Charsets.UTF_8); |
1
|
FileUtils.readLines( new File(path)); |
The problem with this approach is that all the file lines are kept in memory – which will quickly lead to OutOfMemoryError if the File is large enough.
For example – reading a ~1Gb file:
1
2
3
4
5
|
@Test public void givenUsingGuava_whenIteratingAFile_thenWorks() throws IOException { String path = ... Files.readLines( new File(path), Charsets.UTF_8); } |
This starts off with a small amount of memory being consumed: (~0 Mb consumed)
1
2
|
[main] INFO org.baeldung.java.CoreJavaIoUnitTest - Total Memory: 128 Mb [main] INFO org.baeldung.java.CoreJavaIoUnitTest - Free Memory: 116 Mb |
However, after the full file has been processed, we have at the end: (~2 Gb consumed)
1
2
|
[main] INFO org.baeldung.java.CoreJavaIoUnitTest - Total Memory: 2666 Mb [main] INFO org.baeldung.java.CoreJavaIoUnitTest - Free Memory: 490 Mb |
Which means that about 2.1 Gb of memory are consumed by the process – the reason is simple – the lines of the file are all being stored in memory now.
It should be obvious by this point that keeping in-memory the contents of the file will quickly exhaust the available memory – regardless of how much that actually is.
What’s more, we usually don’t need all of the lines in the file in memory at once – instead, we just need to be able to iterate through each one, do some processing and throw it away. So, this is exactly what we’re going to do – iterate through the lines without holding the in memory.
3. Streaming Through the File
Let’s now look at a solution – we’re going to use a java.util.Scanner to run through the contents of the file and retrieve lines serially, one by one:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
|
FileInputStream inputStream = null ; Scanner sc = null ; try { inputStream = new FileInputStream(path); sc = new Scanner(inputStream, "UTF-8" ); while (sc.hasNextLine()) { String line = sc.nextLine(); // System.out.println(line); } // note that Scanner suppresses exceptions if (sc.ioException() != null ) { throw sc.ioException(); } } finally { if (inputStream != null ) { inputStream.close(); } if (sc != null ) { sc.close(); } } |
This solution will iterate through all the lines in the file – allowing for processing of each line – without keeping references to them – and in conclusion, without keeping them in memory: (~150 Mb consumed)
1
2
|
[main] INFO org.baeldung.java.CoreJavaIoUnitTest - Total Memory: 763 Mb [main] INFO org.baeldung.java.CoreJavaIoUnitTest - Free Memory: 605 Mb |
4. Streaming with Apache Commons IO
The same can be achieved using the Commons IO library as well, by using the customLineIterator provided by the library:
1
2
3
4
5
6
7
8
9
|
LineIterator it = FileUtils.lineIterator(theFile, "UTF-8" ); try { while (it.hasNext()) { String line = it.nextLine(); // do something with line } } finally { LineIterator.closeQuietly(it); } |
Since the entire file is not fully in memory – this will also result in pretty conservative memory consumption numbers: (~150 Mb consumed)
1
2
|
[main] INFO o.b.java.CoreJavaIoIntegrationTest - Total Memory: 752 Mb [main] INFO o.b.java.CoreJavaIoIntegrationTest - Free Memory: 564 Mb |
5. Conclusion
This quick article shows how to process lines in a large file without iteratively, without exhausting the available memory – which proves quite useful when working with these large files.
The implementation of all these examples and code snippets can be found in my github project – this is an Eclipse based project, so it should be easy to import and run as it is.