Categories

Versions

You are viewing the RapidMiner Server documentation for version 9.1 - Check here for latest version

PostgreSQL database size

RapidMiner Server 9.0.2 and above are not affected by this issue. If you have updated RapidMiner Server to the patched version and have previously manually added the database triggers you should remove the obsolete triggers.

If you're running RapidMiner Server 8.0 till 9.0.1 with PostgreSQL, the database might grow large. This relates to PostgreSQL's mechanism to handle large objects. Internally, PostgreSQL is using a separate Large Objects table which holds references to large entries in the database, but unused entries, which are not referenced anymore, are not removed automatically from this table.

This problem can be addressed with the following steps:

  1. Add a trigger function which will handle deleting unused references for future operations.
  2. Clean up old unused table references once. This operation will not reclaim used disk space by PostgreSQL immediately, but allocated disk space will first be used by newly created objects.
  3. Optionally run PostgreSQL's internal mechanism VACUUM to reclaim disk space used by the large objects table.

RapidMiner Server has to be shut down before starting the procedure outlined in this page.

It’s advised to do a full backup of your database schema after you shut down RapidMiner Server and before you start executing any of the statements mentioned in this page.

Please be aware that the following queries have to be executed as the same user and in the same database which are used by RapidMiner Server.

The following queries will lock tables in your database and prevent other operations from running at the same time.

The first step creates triggers in the database to automatically delete unused references every time specific tables are updated. To create it, please execute the following statement.

-- Trigger function that handles deletions and updates of table columns that contain large object references.
-- The old references are removed from the pg_largeobject table if the row is deleted or updated.
create or replace function rmserver_largeobject_cleanup_trgfunc () returns trigger as
$func$
begin
    -- Update triggers have NEW and OLD
    if TG_OP = 'UPDATE' then
        -- Depending on the table, unlink (delete) the old large object if the reference changed.
        if TG_TABLE_NAME = 'ra_ent_processexecparam' then 
            if new.processcontext <> old.processcontext then 
                perform lo_unlink(old.processcontext);
            end if;
        elseif TG_TABLE_NAME = 'ra_ent_bytebuffer' then
            if new.buffer <> old.buffer then
                perform lo_unlink(old.buffer);
            end if;
        elseif TG_TABLE_NAME = 'ra_ent_jobsubmissionfailure' then
            if new.process <> old.process then
                perform lo_unlink(old.process::bigint);
            end if;
        elseif TG_TABLE_NAME = 'ra_ent_jobsubmissionfailure' then
            if new.message <> old.message then
                perform lo_unlink(old.message::bigint);
            end if;
        elseif TG_TABLE_NAME = 'jobservice_job_error' then
            if new.message <> old.message then
                perform lo_unlink(old.message::bigint);
            end if;
        elseif TG_TABLE_NAME = 'jobservice_job_error' then
            if new.trace <> old.trace then
                perform lo_unlink(old.trace::bigint);
            end if;
        elseif TG_TABLE_NAME = 'jobservice_job_error' then
            if new.explanation <> old.explanation then
                perform lo_unlink(old.explanation::bigint);
            end if;
        elseif TG_TABLE_NAME = 'jobservice_job_log' then
            if coalesce(new.log, '') <> old.log then
                perform lo_unlink(old.log::bigint);
            end if;
        elseif TG_TABLE_NAME = 'jobservice_job' then
            if new.process <> old.process then
                perform lo_unlink(old.process::bigint);
            end if;
        end if;
    
        return NEW;
    elseif TG_OP = 'DELETE' then
        -- Depending on the table, unlink (delete) the referenced large object.
        if TG_TABLE_NAME = 'ra_ent_processexecparam' then 
            perform lo_unlink(old.processcontext);
        elseif TG_TABLE_NAME = 'ra_ent_bytebuffer' then
            perform lo_unlink(old.buffer);
        elseif TG_TABLE_NAME = 'ra_ent_jobsubmissionfailure' then
            perform lo_unlink(old.process::bigint);
        elseif TG_TABLE_NAME = 'ra_ent_jobsubmissionfailure' then
            perform lo_unlink(old.message::bigint);
        elseif TG_TABLE_NAME = 'jobservice_job_error' then
            perform lo_unlink(old.message::bigint);
        elseif TG_TABLE_NAME = 'jobservice_job_error' then
            perform lo_unlink(old.trace::bigint);
        elseif TG_TABLE_NAME = 'jobservice_job_error' then
            perform lo_unlink(old.explanation::bigint);
        elseif TG_TABLE_NAME = 'jobservice_job_log' then
            perform lo_unlink(old.log::bigint);
        elseif TG_TABLE_NAME = 'jobservice_job' then
            perform lo_unlink(old.process::bigint);
        end if;
        return OLD;
    end if;


end;
$func$
language plpgsql
volatile;

drop trigger if exists ra_ent_processexecparam_lob_trg on ra_ent_processexecparam;
create trigger ra_ent_processexecparam_lob_trg after update or delete on ra_ent_processexecparam
    for each row execute procedure rmserver_largeobject_cleanup_trgfunc();

drop trigger if exists ra_ent_bytebuffer_lob_trg on ra_ent_bytebuffer;
create trigger ra_ent_bytebuffer_lob_trg after update or delete on ra_ent_bytebuffer
    for each row execute procedure rmserver_largeobject_cleanup_trgfunc();

drop trigger if exists ra_ent_jobsubmissionfailure_lob_trg on ra_ent_jobsubmissionfailure;
create trigger ra_ent_jobsubmissionfailure_lob_trg after update or delete on ra_ent_jobsubmissionfailure
    for each row execute procedure rmserver_largeobject_cleanup_trgfunc();

drop trigger if exists jobservice_job_error_lob_trg on jobservice_job_error;
create trigger jobservice_job_error_lob_trg after update or delete on jobservice_job_error
    for each row execute procedure rmserver_largeobject_cleanup_trgfunc();

drop trigger if exists jobservice_job_log_lob_trg on jobservice_job_log;
create trigger jobservice_job_log_lob_trg after update or delete on jobservice_job_log
    for each row execute procedure rmserver_largeobject_cleanup_trgfunc();

drop trigger if exists jobservice_job_lob_trg on jobservice_job;
create trigger jobservice_job_lob_trg after update or delete on jobservice_job
    for each row execute procedure rmserver_largeobject_cleanup_trgfunc();

Before starting the clean up procedure, let's verify that your setup requires this clean up. Please execute the following statement. If the amount of existing large objects is low and the amount of large objects within the pg_largeobject table is much higher, you probably should do the clean up outlined in step 2.

This query works only on PostgreSQL 9.4 and newer. (Older releases are not supported anymore by the PostgreSQL project, you should upgrade.)

with oids as (
    select processcontext as loboid from ra_ent_processexecparam
    union
    select buffer from ra_ent_bytebuffer
),
texts as (
    select process as loboidt from ra_ent_jobsubmissionfailure
    union
    select message from ra_ent_jobsubmissionfailure
    union 
    select message from jobservice_job_error
    union
    select trace from jobservice_job_error
    union
    select explanation from jobservice_job_error
    union
    select log from jobservice_job_log
    union
    select process from jobservice_job
),
existing_oids as (
    select loboid
    from oids
    where loboid is not null
    union
    select loboidt::oid
    from texts
    where loboidt is not null
),
nonexisting_oids as (
    select lom.oid as nexoid
    from pg_largeobject_metadata lom
    left outer join existing_oids eo on eo.loboid = lom.oid
    where eo.loboid is null
),
stats as (
    select 'Existing large objects'::text as category,
        count(loboid) as objectcount,
        pg_size_pretty(sum(length(lo_get(lom.oid)))) as size
    from existing_oids eo
    inner join pg_largeobject_metadata lom on lom.oid = eo.loboid
    union
    select 'pg_largeobject table' as category,
        count(oid),
        pg_size_pretty(pg_relation_size('pg_largeobject'))
    from pg_largeobject_metadata lom
)
select *
from stats;

Please execute the following statement to free allocated table space by unused references. (We took great care to create a complete list of references that are searched in the pg_largeobject table. However, in special situations and if you used the RapidMiner backend database for other purposes, this might delete data that's still being referenced. You would see error messages like "Large Object 252974 doesn't exist" after starting Server again. You must restore your database backup in this case.)

with oids as (
    select processcontext as loboid from ra_ent_processexecparam
    union
    select buffer from ra_ent_bytebuffer
),
texts as (
    select process as loboidt from ra_ent_jobsubmissionfailure
    union
    select message from ra_ent_jobsubmissionfailure
    union 
    select message from jobservice_job_error
    union
    select trace from jobservice_job_error
    union
    select explanation from jobservice_job_error
    union
    select log from jobservice_job_log
    union
    select process from jobservice_job
),
existing_oids as (
    select loboid
    from oids
    where loboid is not null
    union
    select loboidt::oid
    from texts
    where loboidt is not null
),
nonexisting_oids as (
    select lom.oid as nexoid
    from pg_largeobject_metadata lom
    left outer join existing_oids eo on eo.loboid = lom.oid
    where eo.loboid is null
)
select nexoid as invalid_oid, lo_unlink(nexoid)
from nonexisting_oids;

As mentioned above, the clean up statement won't reclaim disk space already used by your PostgreSQL instance immediately. However, you can reclaim it. Please execute the following statement to do so.

Remove introduced triggers and function

RapidMiner Sever 9.0.2 fixes the described issue and as consequence the introduced triggers become obsolete and should be removed from the database. Please execute the following statement to remove the triggers and the corresponding function:

drop trigger if exists ra_ent_processexecparam_lob_trg on ra_ent_processexecparam;
drop trigger if exists ra_ent_bytebuffer_lob_trg on ra_ent_bytebuffer;
drop trigger if exists ra_ent_jobsubmissionfailure_lob_trg on ra_ent_jobsubmissionfailure;
drop trigger if exists jobservice_job_error_lob_trg on jobservice_job_error;
drop trigger if exists jobservice_job_log_lob_trg on jobservice_job_log;
drop trigger if exists jobservice_job_lob_trg on jobservice_job;

drop function if exists rmserver_largeobject_cleanup_trgfunc();