BUG - Data race found in Apario Writer

Andrei · December 3

Last week I started working with Erik to get the Project Minnesota documents in their own instance of Apario Reader. While the project was originally built for the JFK Files and then expanded for the STAR GATE Files, as more record sets begin getting ingested into the software, these corner cases are going to show up. This is what is called a DATA RACE where a concurrency process is writing to the same resource at the same time and there is a problem because there can only be one write at a time; not two. when you have 2+, you have a data race that requires you to refactor your code. These kind of problems can take hundreds of hours to debug depending on how deep they are into the code. Fortunately, this bug is easy to fix.

WARNING: DATA RACE
Write at 0x00c0069680c0 by goroutine 546129:
  runtime.mapassign_fast64()
      /home/andrei/go/versions/1.23.1/go/src/runtime/map_fast64.go:113 +0x0
  main.aggregatePendingPage()
      /home/andrei/work/projectapario/apario-writer/aggregator.go:72 +0x4a6
  main.receiveCompletedPendingPage.gowrap1()
      /home/andrei/work/projectapario/apario-writer/receivers.go:256 +0xf3

Previous write at 0x00c0069680c0 by goroutine 546130:
  runtime.mapassign_fast64()
      /home/andrei/go/versions/1.23.1/go/src/runtime/map_fast64.go:113 +0x0
  main.aggregatePendingPage()
      /home/andrei/work/projectapario/apario-writer/aggregator.go:72 +0x4a6
  main.receiveCompletedPendingPage.gowrap1()
      /home/andrei/work/projectapario/apario-writer/receivers.go:256 +0xf3

Goroutine 546129 (running) created at:
  main.receiveCompletedPendingPage()
      /home/andrei/work/projectapario/apario-writer/receivers.go:256 +0x31c
  main.main.gowrap12()
      /home/andrei/work/projectapario/apario-writer/apario-writer.go:176 +0x4f

Goroutine 546130 (finished) created at:
  main.receiveCompletedPendingPage()
      /home/andrei/work/projectapario/apario-writer/receivers.go:256 +0x31c
  main.main.gowrap12()
      /home/andrei/work/projectapario/apario-writer/apario-writer.go:176 +0x4f

So the DATA RACE is occurring at the highlighted segment int64(pp.PageNumber). However, that is actually not where the bug is. It's here...

Why is that the bug? Well, let me explain. aggregatePendingPage is invoked this way...

Now do you see where the concurrency is coming into play? Well, if you don't, fear not you'll understand soon enough. This function is a receiver function that starts a for / select infinite loop that blocks the thread its running on, waiting either for the ctx to complete (called its context), or the ch channel receives new data as pp, ok := ipp.(PendingPage). The use of the go before the aggregatePendingPage spawns a new goroutine for the page to be processed. This means if 1 document has 3 pages, 3 threads will attempt to write to the parent document with the page data.

Talk about the bug being in the very end of the program... literally we spent 24 hours compiling the data, everything compiled, but the final document.json file didn't compile correctly because of the DATA RACE, so we need to address this major concern.

Going back to the code, we have this line:

Ultimately, we need to look further up in the file, specifically at this line:

The sm_documents variable is a sync.Map data type that uses a key -> value pattern to store concurrent data as the value accessible by a string key. In this case, I am using the Document ID (RecordIdentifier) as the string key of the sm_documents Map. Therefore, in order to access the underlying data safely in the request, we need to ensure that the mutex that exists is legit and can be used across multiple goroutines running the aggregatePendingPage func. Therefore, let's dig further back...

The sm_documents Map is used 4 times throughout the apario-writer application, 2 times in the processor.go, 1 time in the aggregator.go and 1 time in the record.go. We need to determine what each of these writes are doing specifically and how we can understand their error relating to the *sync.Mutex{} not being properly set in the underlying document stored inside the sm_documents Map. Therefore, let's dig into the 3 .Store( uses of the sm_documents Map.

This first usage of the .Store( of the sm_documents Map is relating to ingesting a PDF document over a URL that requires the file to be downloaded first before it can be processed. The bug is here actually. But its in 3 locations total. We need to fix all 3 locations.

Can you see the bug?

I do.

Lets define the &sync.Mutext{} here, instead of doing this...

This means, for URL imports, in this block of code will never run line 63 now as document.mu will never be nil now. Let's do the remaining 2 locations next.

Again, we can see that the mu is not defined here, so we need to modify it by adding after line 459.

Now lets do the last one...

That will add a new line after 267 to declare the &sync.Mutext{} within the document, in this case, it's string key is defined as the variable identifier but the underlying data being stored in the Map is a Document{} and given that the property mu was undefined before, it was causing issues with reliably locking/unlocking the mutex.

Now let's go back to the problem segment of code...

Notice how we're performing on Line 66 the document.mu.Lock() before modifying the Map data types within the Document itself. However, before the mu was not properly defined; and it was trying to define it from within the concurrent call of the go aggregatePendingPage() called here...

However, now that we've moved the declaration of the Document's mu property to the .Store( location of the sm_documents Map, we can safely remove these lines...

However, we're not required to. But instead, since we modified how the &sync.Mutex{} is being initialized within the document itself, we should return an error here if the Mutex is not defined. It's actually a fatal problem that you want to instantly capture.

And voila! Bug is fixed. Maybe. Now I need to test it. I'll submit this as is and let you mull on it and then when I determine if the bug has been fixed, I'll submit the change to GitHub for everyone else who is being impacted by this bug.

I cannot stress the importance enough of using these forums in America's Comeback to report bugs and problems that you have with Apario Reader and Apario Writer here in these forums that are designated accordingly. When you run into issues like this, tell me and I'll be able to go through the issue just like this and bring you to a solution. I am the architect of this software and I singlehandedly built for the American public an $8M piece of Open Source Software called Project Apario. It was Yeshua Ha'Maschiah who asked me to work on it and serve in that capacity; thus using the gifts and the talents that He gave me to serve others selflessly where they need it most. Society needs a biblical table-flip on the idea of the mainstream media because it's been controlled by racists for decades who truly hate the American people. Why continue to give them your attention, and thus your mind in exchange for money they earn from Satan for brainwashing your mind in the process, when you could bypass the fake news media altogether and just learn the truth about Martin Luther King directly in the JFK Assassination Files... because the racist Talmudic commies who infiltrated our Government want a race war between white and black Americans, and that's bullshit. Instead, see the facts for yourself that Anderson Cooper is paid millions not to say out loud. Truly...

However, this bug appears to be squashed now. At first I thought it was a simple move of the mutex and then when I ran the Apario Writer on the Project Minnesota documents that Erik provided, it revealed this bug. This bug has been fixed but is currently untested. I am going to re-compile the Apario Writer, resubmitting it to the Apario Network and then re-running the writer to finish the ingestion of his few thousand pages of data that he wants to release. It'll take me another 17 or so hours to finish the compliation and see if the bug is fixed or not. If its fixed, I'll post an update in this thread. If its not fixed, I'll post an update in the thread with the new error and the next patch to this bug.

Andrei · December 16

It's been a few days since I've provided an update on this bug, and I wanted to keep you apprised on the findings.

So, I thought that the locker was the issue and that the data race was the actual bug. It wasn't. Yes, the error was a data race, but the root cause had absolutely NOTHING to do with this file, this function or even the files that were being processed at the time. No, the bug was much easier than that, but much tricker to find than that too. Sometimes, bugs are not simply at the line number where the syntax or runtime issue occurred; rather, bugs can be structural around the valid syntax that causes runtime to be ineffective.

Given the amount of time that I spent debugging the original problem and the "this isnt the" solution solution to the problem, turned out that in order for me to find the bug that was causing this problem; at the end of the code, I needed to start at the very beginning of the program.

I'll spare you a long story. The issue was the binary pdfcpu was updated and the output of the gs 1.7 optimized PDF was incompatible with files created using Pages from Apple. When the output of the pdfcpu binary changed, apario-writer failed to pick up the changed property of .Pages to .PageCount to represent the total page count of the PDF itself. The aggregator func was looking at the Total Pages value to determine if the document has finished processing.

The pdfcpu binary update added stability to the PDF ingestion process, but it also provided me with the opportunity to simplify runtime of the program. This meant that I needed to completely replace the validatePdf(ctx context.Context, record ResultData) (ResultData, error) func to the newly rewritten validate_result_data_record(ctx context.Context, record ResultData) (ResultData, error).

Introduced in this func was the log_info, log_error, and the unseen log_debug.

I introduced 2 types of new Logger methods, and then a blend of the two as the third new func alongside their formatted f equivalents.

When implementing this new logger package, it resulted in low level functions that look like this:

Getting refactored to look like this:

In the event that one of the err assignments is not nil, the log_error variable is a pointer to the *CustomLogger that was defined in main(). The sem_download Semaphore is used to protect the runtime of the appliance, and any issues are instantly logged into the error.log with a stack trace of the error when its received. The stack trace looks like this:

Therefore, when using the CustomLogger throughout the entire code base of the apario-writer application, I was able to identify that the issue Erik and I were running into with Project Minnesota being imported into the Apario Reader was the nuance details of PDF's proprietary vs open source analysis and modification tools available. Since all of Project Apario's work is GPL-3 Open Source, we're using open source libraries for our PDF analysis and documents created with proprietary word processors like Pages or LibreOffice can cause issues with the underlying PDFs; thus why GhostScript is utilized.

Previously, in the validatePdf(ctx context.Context, record ResultData) (ResultData, error) func, the gs binary from GhostScript was used to convert the PDF into compatibility-level 1.7. I thought this was going to be sufficient, and while it was for the JFK files and the STAR GATE files, it was not for the Project Minnesota files. Therefore, I needed to fix that bug. The code was originally:

These options simply didn't cut it anymore and I needed to adjust the code a little more to make it work with these files for Erik.

Dropping the -dCompatibilityLevel from 1.7 to 1.5 meant that the pdfcpu info -json PDF-Path wasn't going to throw errors on the analysis and when the PDF is converted into a PNG, the underlying PNG file won't be invalid and throw errors relating to image.NRGBA.

Before find the pdfcpu change in the Pages vs PageCount property change, I was debugging the issues that were throwing the image.NRGBA; I was able to find those errors and debug them because of the log_error.TraceReturn used throughout this code; this way when things caused issues, I knew the trace of the commands down to main where the error occurred. This is key when dealing with concurrent programs; you need to understand where and how the error happened and get a snapshot in the stack at the time the error is captured.

This comprehensive function provides additional stability and debugging knowledge into the PNG related errors that can occur due to PDF files. Ultimately, I was chasing a goose at this point, as the issue was an easy fix... but at the time, when I was debugging this, I couldn't see that the output of pdfcpu changed thus breaking the code. However, as more of the log_error updates were persisted through the project, the overall quality of the project increased as well as its debugability.

Now that these changes have been integrated into the apario-writer project, I am going to pivot over to the apario-reader app and provide a similar kind of enhancement.

Ultimately, this bug has been resolved. It's been squashed alongside the other termites that hung around it that were lurking in the shadows. The log_error stuff really leveled me up in terms of slaying these bugs. Frankly, they were oversights on my part and I should have put the time in sooner to build the CustomLogger, but c'est le vie, right? Can't have it perfect always, or can we?

Sign In

BUG - Data race found in Apario Writer

Recommended Posts

Andrei

Andrei

Join the conversation

Browse

Activity

Store

Support