Splunk ETL

Extract, Transform and Load data into Splunk

Before downloading, please read our terms and conditions

Splunk is the bee’s knees when it comes to analysing and visualising log file data. If you haven’t used it yet, I highly recommend it: Splunk. I’ve been working with it for quite a while but recently I’ve had a requirement to keep the license cost down whilst monitoring log files incrementally. License costs are based on how much data you consume daily.

Historically I have just been overwriting the full log file locally every 10 minutes and having Splunk index on that directory. But to keep the license costs down I want to filter out the required data first, before indexing. This will no doubt speed things up as well, as I find I only need 10% of the data in the files.

Also, I have historically had an issue occasionally with Splunk double indexing requests. It’s not clear exactly why this occurs but it seems to be impacted by the way you copy over your log files. I have some workarounds below that reduce this problem.

Extract (grep for Windows)

To filter the data before indexing I follow a couple of steps. First step is to copy the full log file to a local temp directory and grep for the lines I want to keep. I do have unix utils and cygwin but grep from these wasn’t up to the job so I’ve written a couple of VB scripts to do this:

not_grep_by_field_to_new_file.vbs.txt

grep_by_field_to_new_file.vbs.txt

grep_to_new_file.vbs.txt

The main thing these do that was tricky with unix grep for windows is to read in a file of multiple grep terms, so a typical command line is:

cscript "c:\vbs scripts\not_grep_by_field_to_new_file.vbs" temp.log c:\logs01\grep_list_for_project1.txt 7

(“This script RETRIEVES the lines from the first input file where the terms in the second input file are NOT found in the specified field”)

Transform and Load

(Not doing any data transformation in this case)

To load the file we are a bit careful. This is where I have seen issues with double indexing data and have worked around most of these. I still have some issues and have made one last change. I need to wait a week to see if the issue has fully gone. Details below. [UPDATE: the process presented on this page is now full working for me]

[Latest update: to implement full local log file rollover based on our local update time period e.g. every ten minutes]

  • I have Splunk constantly monitoring a directory on my local desktop for files called ‘access.log.*’
  • Every 10 minutes (using a local Jenkins) the full access log is downloaded to temp.log and grepped to ‘grepped_temp.log’ (see above)
  • We then use a vb script to compare the last grep result to the current grep result
    • if the current grep result has more lines, then we copy these new lines to a buffer (this step has been updated several times, appending, cp and mv have all caused me double indexing issues)
    • if the current grep result is the same as the last grep result, don’t do anything
    • if the current grep has fewer lines than the last grep AND the first lines in the files are different^, then assume log file rollover on the server and start anew locally.
  • If we have new lines (in the buffer or a whole new file ‘cos of server log file rollover) then delete all the ‘access.log.*’ files locally and make a new rollover file called ‘access.log.datestamp’. This new file contains the new lines (either from the buffer or the whole new file)

rollover_log_file_date.vbs.txt

cscript "c:\backup\vbs scripts\rollover_log_file_date.vbs" grepped_temp.log access.log

[Some history to highlight lessons learnt:

First attempt: Previous versions of this script started by creating a new version of the local ‘access.log’ file in the loop around the grep results - writing each line in turn. I then changed to write out to a buffer file and copy this over the existing local ‘access.log’ file. This seemed to clear up the double buffering in one project but not completely from another project.

Second attempt: So I have switched to appending, copying the process that naturally occurs on the server itself. Log files are appended to until they reach a particular size or date. I am more hopeful that this will work. We still want the intermediary steps to grep for just the lines we want. The only question will be how log file rollover is handled. NOTE: Latest change is now to to go even further, as described above, and actually implement full log file rollover locally. Another change had to be made in how log file rollover is detected. We now check two things: 1. the latest file is smaller than the previous file AND 2.^ the first lines of data in the two versions are different: thus we certainly must have started a new file. Without this last check, I did see issues on some servers in the early hours of Sunday morning as log files did not seem to change over smoothly in one operation.

Third attempt: Back to copying over with a buffer file but this time always having a brand new time stamped local log file so Splunk won’t be double indexing anything]

If this method works, I would even use it when monitoring complete log files (without the grep) because I have seen double indexing with whole files before, which have just been copied over the local instance. This method of mimicking server behaviour, implementing full local log file rollover should avoid all issues. Splunk should be designed to cope with this scenario.

Setting up Splunk to monitor access.log.*

image003

Files generated on the local box as the process above is implemented

(Note, once you are confident with the process you can turn down the logging in the script and avoid getting all the copies of the ‘file_so_far’ shown below)

image005

Real world example

Final batch file to get data from a log server and prepare it for Splunk to index. This job is called by Jenkins once every 10 minutes throughout the day:

rem //<project> PROD boxes

"C:\Program Files (x86)\PuTTY\pscp.exe" -p -v -l <username> -pw <password> 192.168.11.122:/environments/ec2/logs/app01.<project>-prod.<x>cloud.co.uk/mnt/www/logs/access.log C:\logs01\environments\ec2\logs\app01.<project>-prod.<x>cloud.co.uk\mnt\www\logs\temp.log

cd C:\logs01\environments\ec2\logs\app01.<project>-prod.<x>cloud.co.uk\mnt\www\logs

cscript "c:\backup\vbs scripts\not_grep_by_field_to_new_file.vbs" temp.log c:\logs01\grep_list_for_<project>.txt 7

cscript "c:\backup\vbs scripts\rollover_log_file_date.vbs" grepped_temp.log access.log

Information relating to the above

grep_list_for_<project>.txt:
.jpg

typical log line: field 7 is the url:
xx.yy.zz.aa - - [06/Mar/2014:13:49:12 +0000] "GET /<project_code>/<element> HTTP/1.1" 200 159534 "http://www.google.co.uk/url?sa=t&rct=j&q=&esrc=s&source=web&cd=3&ved=<value>&url=http%3A%2F%2Fwww.<site>.com%2F<project_code>%2F<element>&ei=<value2>&usg=<value3>&sig2=<value3>" "Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; WOW64; Trident/7.0)" - 0.073

How to spot double indexing

Since I’m sending reports up to managers I really need to be sure that the numbers are correct. So Splunk double indexing is a serious concern. It can potentially double the load reported and could lead to wrong decisions about server configuration and business performance requirements. And since it’s variable it can be difficult to spot - I’m trying to fix this (again) after an unexpected load seen last Sunday. I was only curious as we were not expecting increased load on a Sunday versus the Saturday. This turned out to be double indexing rearing it’s head again.

The first thing to watch for is unexpected increases in load, such as I saw that Sunday. To track this down directly, zoom in on a second and look for duplicate entries in the Splunk GUI.

Then click on view source and typically you can see that the source does not match the report. So for example, the count of calls in any given second or micro second may be a multiple of what is seen in the source file. And I have typically seen doubling up of calls, not always in strict order in the Splunk GUI report. And I have seen 3 or 4 times.

If it is not directly obvious from the above, do a manual grep on the actual source file based on a tight time frame. But be sure your methodology is robust and reports are giving correct numbers over a full business period (1 or 2 weeks) before relying on these methods.

Before downloading, please read our terms and conditions

[Home] [About (CV)] [Contact Us] [JMeter Cloud] [webPageTest] [_64 images] [asset moniitor] [Linux Monitor] [Splunk ETL] [Splunk API] [AWS bash] [LR Rules OK] [LR Slave] [LR CI Graphs] [LoadRunner CI] [LR CI Variables] [LR Bamboo] [LR Methods] [LR CI BASH] [Bash methods] [Jenkins V2] [Streaming vid] [How fast] [Finding Issues] [Reporting] [Hand over] [VB Scripts] [JMeter tips] [JMeter RAW] [Dynatrace] [Documents] [FAQ] [Legal]

In the Cartesian Elements Ltd group of companies